AI与机器学习领域最新研究进展汇总（2026年5月中旬）

发布时间：2026-05-18 11:24阅读：12

人工智能(cs.AI:Artificial Intelligence)

【1】Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

标题: 基于自主式大语言模型引导树搜索的多病原体疾病前瞻性预测

链接: https://arxiv.org/abs/2605.16238

作者：Sarah Martinson,Michael P. Brenner,Martyna Plomecka,Brian P. Williams,Nicholas G. Reich,Zahra Shamsi

摘要：传染性疾病的概率预测对公共卫生具有重要意义，但现有方法依赖专家建模团队进行繁重的手动模型管理工作。这种定制化的开发模式阻碍了对精细地理分辨率或新发病原体的规模化应用。本研究提出了一个利用大语言模型引导树搜索的自主系统，用于迭代生成、评估和优化可执行预测软件。在2025-2026年美国呼吸道疾病季节的全面前瞻性实时评估中，该系统自主发现了流感、新冠病毒和呼吸道合胞病毒的多种方法学差异模型。将这些机器生成的模型聚合后形成的集成方案，在样本外预测中始终达到或超越美国疾病控制与预防中心的人类策划的黄金标准集成模型。该系统成功处理了呼吸道合胞病毒数据稀缺的"冷启动"场景。此外，受控的回顾性消融实验表明，优化对数尺度距离度量可防止奖励黑客攻击，而自动循环评判机制则确保了对复杂科学理论的结构保真度。该框架通过自动将流行病学理论转化为准确、透明代码的方式，克服了建模劳动力瓶颈，从而能够以前所未有的规模快速部署专家级疾病预测。

Abstract：Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.

【2】FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

标题: FORGE：通过群体广播实现无需权重更新的自进化智能体记忆

链接: https://arxiv.org/abs/2605.16233

作者：Igor Bogdanov,Chung-Horng Lung,Thomas Kunz,Jie Gao,Adrian Taylor,Marzia Zaman

摘要：大语言模型智能体能否通过自生成记忆而无需梯度更新来提升决策能力？本研究提出FORGE（失败优化反思毕业与进化），这是一种分阶段的、基于群体的协议，用于为分层ReAct智能体进化提示注入的自然语言记忆。FORGE包装了一个类Reflexion的内循环，其中专用反思智能体（使用相同的底层大语言模型，不从更强模型中蒸馏）将失败轨迹转换为可复用的知识产物：文本启发式（规则）、少量演示（示例）或两者结合（混合），以及一个外循环，将表现最佳实例的记忆在阶段间传播到群体中，并通过毕业标准冻结已收敛的实例。我们在CybORG CAGE-2上进行了评估，这是一项针对B线攻击者的30步范围内随机网络防御部分可观测马尔可夫决策过程，其中四个测试的大语言模型系列（Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B）都表现出强烈的负、重尾零样本奖励。与零样本基线和反思基线（隔离单流学习）相比，FORGE在所有12种模型表示条件下将平均评估回报比零样本提升了1.7-7.7倍，比Reflexion提升了29-72%，将重大失败率（低于-100）降低至约1%。研究发现：（1）群体广播是关键机制，无毕业消融确认广播带来性能提升，而毕业主要节省计算；（2）示例为四种模型中的三种实现了最强的回报，规则提供了最佳的成本可靠性权衡，令牌数量减少约40%；（3）较弱的基线模型受益不成比例，这表明FORGE可能缩小能力差距而非放大强模型。所有证据仅限于CAGE-2 B线；跨家族的发现为方向性证据。

Abstract：Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

【3】Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

标题: 完全开源的Meditron：临床大语言模型的可审计流水线

链接: https://arxiv.org/abs/2605.16215

作者：Xavier Theimer-Lienhard,Mushtaha El-Amin,Fay Elhassan,Sahaj Vaidya,Victor Cartier-Negadi,David Sasu,Lars Klein,Mary-Anne Hartley

摘要：临床决策支持系统需要可追溯、可审计的流水线，以实现严格、可重复的验证。然而，当前基于大语言模型的临床决策支持系统仍然基本不透明。大多数"开放"模型仅开放权重、发布参数，同时保留决定模型行为的数据

← 上一篇：湖北举办AI赋能新型工业化专题研讨班下一篇：AI量化金融课程 | 智能时代的金融工程实践 →