AI驱动的科学革命：智能体引领研究新纪元

发布时间：2026-05-01 22:49阅读：27

Artificial intelligence (AI) advocates are betting that ‘AI agents’ are the application of this technology that will affect society the most. Agentic AI involves using a large language model (LLM) to carry out multi-step tasks, by connecting it to external tools such as Internet browsers or coding suites. The hope is that AI assistants can be created that simplify real-world tasks. In science, some think that AI agents — perhaps even several working together — will not just save time, but also eventually run their own experiments and generate knowledge.

人工智能（AI）支持者们预测，“人工智能代理”将是这项技术对社会产生最深远影响的应用。代理式人工智能通过连接互联网浏览器或编码套件等外部工具，利用大型语言模型（LLM）来执行多步骤任务。其目标是创造能够简化现实世界任务的AI助手。在科学领域，一些人认为，人工智能代理——甚至可以协同工作的多个代理——不仅能节省时间，更有潜力独立进行实验并创造新知识。

But this dream is not yet a reality. Although access to AI agents is already being sold by technology firms, many such agents are either limited in scope or exist in beta versions that require significant human oversight. Because they are based on LLMs, which are, at heart, statistical prediction machines, they are prone to making mistakes known as hallucinations. In a trial earlier this year by Anthropic in San Francisco, California, to see whether its agent Claudius could run a vending-machine-based shop, the agent conjured up fake bank account details and sold some items at a loss.

然而，这一愿景尚未完全实现。尽管科技公司已开始提供AI代理的访问权限，但许多代理的功能受限，或仍处于需要大量人工监督的测试阶段。由于它们基于本质上是统计预测机器的大型语言模型（LLM），因此容易出现所谓的“幻觉”错误。今年早些时候，Anthropic公司在加州旧金山的一次测试中，其名为Claudius的代理在尝试经营一家基于自动售货机的商店时，虚构了银行账户信息，并导致部分商品亏本出售。

Naturespoke to researchers developing, evaluating and using AI agents to find out how scientists can make use of the bots and mitigate the risks.

《自然》杂志采访了正在开发、评估和使用人工智能代理的研究人员，以探讨科学家如何利用这些机器人并规避相关风险。

Researchers already use automated tools, for example, citation managers that organize and format references, and workflow packages that process and analyse data. But AI agents are different. Rather than follow prescribed instructions for each task, agents use LLMs to make and refine plans on the fly for a variety of multi-step goals. Unlike lone LLMs, they also harness tools to take actions in the real world — for instance, to write and run code or navigate databases — with some interacting with each other and using working memory to remember user preferences and previous actions.

研究人员现已广泛使用自动化工具，例如用于整理和格式化参考文献的引用管理器，以及用于处理和分析数据的流程包。但人工智能代理则有所不同。它们并非遵循预设的指令，而是利用LLM实时制定和优化计划，以实现各种多步骤目标。与独立的LLM不同，AI代理还能借助工具在现实世界中采取行动——例如编写和执行代码，或检索数据库信息——部分代理还能相互协作，并利用工作记忆来记录用户偏好和过往操作。

Streamlining everyday research tasks is one goal. “In my group, every PhD student now has their own AI agent that effectively serves as a research assistant,” says Marinka Zitnik, a researcher in biomedical informatics at Harvard University in Boston, Massachusetts. These home-made agents help Zitnik’s team to perform low-stakes tasks, such as curating data sets, turning text into tables and writing certain pieces of code, she says.

简化日常研究任务是目标之一。哈佛大学生物医学信息学研究员Marinka Zitnik表示：“在我的团队中，每位博士生现在都拥有一个相当于研究助理的人工智能代理。”她提到，这些自制的代理能够协助Zitnik团队完成一些低风险的任务，例如数据整理、文本转表格以及编写部分代码。

One appealing application of agents lies in using them to emulate the collaboration of several researchers with different expertise. An example is the AI ‘tumour board’ being developed by Microsoft. In this case, agents, each with access to different data sets and training, interact to mimic the deliberations of the multidisciplinary team that determines an individual treatment plan for a person with cancer. Because tumour boards are usually formed only for patients with the most complicated cases, using health-care agents to assist clinicians could allow personalized care to be provided for more people, says Ece Kamar, who leads the AI Frontiers laboratory at Microsoft Research, based in Redmond, Washington. (In a statement in May, Microsoft said that its health-care AI models were intended for research use and were not to be deployed in clinical settings “as-is”.)

代理的一个引人注目的应用在于模拟多位拥有不同专业知识的研究人员的协作。微软正在开发的AI“肿瘤委员会”便是其中一个例子。在此模型中，拥有不同数据集和训练的代理会进行交互，模拟多学科团队的讨论过程，为癌症患者制定个性化治疗方案。微软研究院AI前沿实验室的负责人Ece Kamar指出，由于肿瘤委员会通常仅针对最复杂的病例患者，因此利用医疗AI代理协助临床医生，有望为更多患者提供个性化护理。（微软在五月份的声明中强调，其医疗AI模型仅供研究使用，暂不直接部署于临床环境。）

This is an idea that excites researchers, but the answer remains unclear. Google and many other firms and academic groups have developed ‘co-scientist’ agents, which generate hypotheses by looking for hidden insights in existing data. The co-scientist uses multiple agents to, for example, evolve and improve ideas, or test them against each other.

这一概念令研究人员兴奋，但其具体实现仍有待明确。谷歌及众多其他公司和学术机构已开发出“协同科学家”代理，它们通过挖掘现有数据中的隐藏洞见来生成科学假设。例如，这些协同科学家可以利用多个代理来不断发展和完善想法，或进行相互验证。

Zitnik and her colleagues are also exploiting an agent’s ability to access live data and ‘reason’ on the basis of those data, in drug discovery. In unpublished work, they used an AI agent to tap into and analyse data on clinical trials, on adverse effects and in regulatory documents, to look for drugs that have protective effects against diseases they were not prescribed for. They found, for example, that people with diabetes who were given dapagliflozin had a lower incidence of Alzheimer’s disease later in life than were those who were not prescribed it. The team is also runningin silico‘clinical trials’ using electronic health records to test hypotheses, she says.

Zitnik及其同事们还在药物研发领域，利用代理访问实时数据并基于这些数据进行“推理”的能力。在一项未公开的研究中，他们使用AI代理分析了临床试验、不良反应以及监管文件的数据，以寻找对非处方疾病具有保护作用的药物。例如，他们发现，服用达格列净的糖尿病患者在晚年患阿尔茨海默病的几率低于未服用该药物的患者。她还提到，该团队正利用电子健康记录进行计算机模拟的“临床试验”来验证其假设。

For simple uses such as literature reviews, agentic AI already exists as packages that “anybody can use”, says Doug Downey, an AI researcher at the Allen Institute for Artificial Intelligence (Ai2) in Seattle, Washington. Although more advanced systems require machine-learning expertise, some researchers are trying to democratize access. Zitnik and her colleagues are developing ToolUniverse, an open online environment that allows researchers to connect LLMs to commonly used tools in different scientific domains, using only natural language commands. This should “make AI agents more broadly accessible to other fields and scientists who do not write code”, she says.

艾伦人工智能研究所（Ai2）的研究员Doug Downey表示，对于文献综述等简单应用，代理式AI已以“任何人都能使用的软件包”形式存在。尽管更复杂的系统需要机器学习专业知识，但一些研究者正致力于普及其应用。Zitnik及其同事正在开发ToolUniverse，这是一个开放的在线平台，允许研究人员仅通过自然语言指令，就能将大型语言模型连接到不同科学领域常用的工具。她认为，这将“使AI代理能够更广泛地惠及其他领域以及不擅长编程的科学家”。

The ultimate agent — one that can get anything done autonomously in a reliable way — is “almost an artificial general intelligence problem”, says Kamar. “We are far from having those agents.” But researchers are attempting to benchmark how well agents perform now.

Kamar认为，能够可靠地自主完成任何任务的终极代理，其难度堪比“通用人工智能问题”，“我们距离实现这样的代理还有很长的路要走。”但研究人员正在努力评估当前代理的表现水平。

The benchmarking tool AstaBench, developed by Ai2, measures how well agents can perform 2,400 scientific tasks. It shows that although agents such as Ai2’s own Asta v0 perform relatively well on tasks such as literature reviews, they are more likely to struggle on harder challenges such as data analysis, as well as attempts to complete the entire scientific workflow, from designing computer-based experiments to producing a report.

由Ai2开发的基准测试工具AstaBench，能够评估代理在2400项科学任务上的表现。测试结果显示，尽管Ai2的Asta v0等代理在文献综述等任务上表现尚可，但在数据分析以及完成从设计计算机实验到生成报告的整个科学流程等更具挑战性的任务时，它们往往力不从心。

“The ability to string together a bunch of successful steps like that is actually far beyond the current capabilities of existing agents,” says Downey. Although agents can do “startlingly intelligent things”, they also get hung up on some things a human would find easy.

Downey指出：“将一系列成功步骤串联起来的能力，远远超出了现有代理的当前能力。”尽管代理能够完成“令人惊叹的智能任务”，但它们在一些对人类而言轻而易举的事情上却会遇到困难。

Evaluating an AI co-scientist is difficult — its failures are not always obvious, and how ‘good’ a hypothesis is can be subjective. “Ultimately, the only definitive way to test an AI-generated scientific hypothesis is through experimental evaluation,” says Zitnik.

评估AI协同科学家颇具挑战性——其失败之处并非总是显而易见，且假设的“优劣”也可能带有主观性。Zitnik总结道：“最终，检验AI生成科学假设的唯一确凿方法是通过实验评估。”

One risk, as with any LLM-based product, is that they will waste users’ time by getting things wrong. Even in a task such as a literature review, there is a risk that an LLM will paraphrase text in a way that means it doesn’t accurately represent the literature, says Downey.

与所有基于LLM的产品一样，一个潜在风险是它们可能因出错而浪费用户时间。Downey提到，即使在文献综述这类任务中，LLM也可能以不准确地反映文献内容的方式进行文本转述。

And reports of more serious issues have emerged in the business world, for example, when a coding agent violated an order to stop generating new code and deleted a company’s database. Such instances happen when AI agents try to reach their goals but don’t understand what actions are appropriate, says Kamar. “If we don’t have the right guard-rails in place to really guide what an agent should or should not do, it is easy for these agents to actually do things that may surprise us,” she says.

在商业领域也出现了更严重的事件报告，例如，有编码代理无视停止生成新代码的指令，并删除了公司的数据库。Kamar解释说，当AI代理在追求目标时，若未能理解何为恰当的行为，便可能发生此类情况。“如果我们未能设置恰当的约束和引导，明确代理的职责范围，那么这些代理就很容易做出出乎我们意料的事情，”她说道。

Getting an AI to refer to original sources, as well as to explain its actions and check in with humans at each step of a process, can help to avoid hallucinations and “significantly bump up” agent performance, says Kamar. Another way to protect against agents going awry is keep them in ‘containers’ that limit the actions they can take or information they can access — for example, making them unable to delete files, she says.

Kamar表示，让AI能够引用原始

← 上一篇：AI编程的最终挑战：让智能助手管理你的代码「给非技术人员的指南」下一篇：警惕AI的心智操控 →