AI前沿突破:循环模型、说服力超越、物理智能新范式
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AI - 人工智能 RO - 机器人
1、[LG] Looped World Models 2、[AI] AI systems out-persuade expert humans 3、[AI] Kairos:A Native World Model Stack for Physical AI 4、[CL] GameCraft-Bench:Can Agents Build Playable Games End-to-End in a Real Game Engine? 5、[RO] Human Universal Grasping
摘要:循环世界模型、AI系统的说服力已超越人类专家、专为物理 AI 打造的原生世界模型技术栈、智能体能否在真实游戏引擎中实现端到端的可玩游戏构建、类人通用抓取
H A Lu, Z.L. V Wei, Q Zhang, J Zeng… [FaceMind Research Asia]
循环世界模型
要点:
主旨: 解决当前世界模型在进行长视距高保真模拟时,面临的“计算成本高昂”与“推演误差极易随步数呈指数级复合累积”这一根本性矛盾。文章通过提出循环世界模型(LoopWM),引入参数共享的循环迭代、自适应深度以及延迟解码等机制,以极小的参数开销实现了极度稳定且高效的长视距动态环境模拟。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本文创新性地提出了循环世界模型(LoopWM),通过将参数共享的循环架构与自适应计算深度相结合,并引入谱稳定性约束与反直觉的“延迟解码”机制,不仅彻底攻克了长视距模拟中的复合误差难题,更以仅10亿参数的微小体量实现了高达100倍的参数效率,成功在多步世界建模任务中击败了千亿参数级巨型模型,开创了通过“迭代潜层深度”来无损缩放世界模型的全新范式。
Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100× parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.
https://arxiv.org/abs/2606.18208
K Hackenburg, C Wagner, L Hewitt, B M. Tappin… [University of Oxford & UK AI Security Institute & Stanford University]
AI系统的说服力已超越人类专家
要点:
主旨: 本文旨在探讨前沿对话式人工智能是否能在真实且高强度的说服性对话博弈中,超越受过严格专业训练且具有高度动机的人类专家。通过四项包含18,978次对话的预注册大规模实验,文章明确证实:无论是在改变人们对复杂政策的政治态度上,还是在促成真实的慈善捐款行为上,AI系统均已全面并压倒性地超越了包括世界辩论冠军和资深募捐员在内的人类说服专家。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本文通过四项极其严谨的大规模实验证实,前沿AI在改变人类态度和促成真实捐款等说服性博弈中,已经全面且压倒性地击败了包括世界辩论冠军和资深专业募捐员在内的人类最强专家;论文揭示了一个极其反直觉的真相——AI的统治力并非源于深不可测的情感伪装,而是来自纯粹的“高信息吞吐量与事实密度”的降维打击,甚至即使让人类专家拜AI为师,也无法突破自身的生理带宽来弥合这一深渊,这预示着未来社会的舆论与行为引导力正面临不可逆转的重塑。
Many societal decisions are settled by contests of persuasion. Conversational AI is a powerful new entrant in these contests [14, 23], but whether it can out-persuade skilled and highly incentivized humans has remained unclear. Here, in a series of four preregistered experiments (n = 18,978 conversations from 6,923 people), we pitted AI systems against a range of human persuaders, including laypeople, winners of a separately preregistered four-round online persuasion tournament, professional canvassers, and world championship debaters. We found that AI systems were reliably more persuasive than expert humans, even when expert humans chose their issues, researched in advance, underwent hours of live, structured practice, and were incentivized with £1,000 cash bonuses. In a follow-up study, AI’s advantage persisted after experts received a coaching tool that let them practice against the AI that beat them, review their performance history, and see what AI would have said at key moments. We found converging evidence that AI’s advantage stemmed from rapidly deploying larger quantities of information: after coaching, expert humans could tie an AI constrained to respond at human speeds and with human-length messages. In a final study, we show that AI’s advantage extends to consequential real-world behavior: AI was nearly 3x more effective than professional canvassers from a UK fundraising firm at raising real-money donations to Save the Children. Together, these results establish that frontier AI systems out-persuade expert humans in conversation, with significant implications for political communication.
https://arxiv.org/abs/2606.16475
K T F Wang, S You, Q Zhang, T Huang… [Kairos Team]
Kairos:专为物理 AI 打造的原生世界模型技术栈
要点:
主旨: 本文旨在将“世界模型”从单一的“被动视频生成器”转变为物理人工智能(Physical AI)的基础设施。文章提出了Kairos框架,通过统一的数据课程学习世界规律、通过混合线性记忆架构持久化维持世界状态,并借助面向部署的软硬件协同设计在边缘端高效运行,从而为未来物理智能体实现“观察-行动-反馈”的闭环自我进化提供原生基座。
创新:
贡献:
提升:
不足:
心得:
一句话总结: Kairos 摒弃了拼接式的后微调路径,通过原生跨具身数据课程注入物理因果律,并创新性地采用混合线性时序记忆架构从数学根源上解决长视距崩坏问题,最终打造出一个能在消费级显卡上高效运行、支撑物理智能体实现闭环自进化的全能型世界模型原生基础设施。
World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation–action–feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency–capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.
https://arxiv.org/abs/2606.16533
T Luo, R Wang, J Bi, C Xu… [The Chinese University of Hong Kong & Shenzhen Loop Area Institute]
GameCraft-Bench:智能体能否在真实游戏引擎中实现端到端的可玩游戏构建?
要点:
主旨: 本文探讨了当前的 AI 编程智能体是否能够在真实的游戏引擎(Godot)中,端到端地将自然语言需求转化为完整、可玩的交互式游戏。为此,作者提出了评估游戏生成不可或缺的三大核心原则,并基于此构建了全新的基准测试集 GameCraft-Bench,以真实交互为依据对现有的前沿大模型进行系统性评估。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本论文提出了评估 AI 游戏生成的三大核心准则,并构建了基于真实引擎交互验证的 GameCraft-Bench 基准测试;评测结果揭示了一个极其反直觉的现状——尽管当前顶尖 AI 智能体能编写复杂的代码,但在构建完整、连贯且视觉可玩的端到端交互式游戏系统时,依然面临着巨大且尚未克服的鸿沟。
Game generation is an emerging application of coding agents, requiring models to transform naturallanguage specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation.
https://arxiv.org/abs/2606.17861
K Y Wu, T Zhou, I Tu, B Yan… [New York University]
类人通用抓取
要点:
主旨: 本文旨在解决多指机器人灵巧抓取中严重的数据瓶颈问题。通过从第一人称视角的智能眼镜中大规模收集人类在自然环境中的抓取行为,作者构建了 1M-HUGS 数据集,并训练了流匹配模型 HUG。HUG 能够仅凭单张 RGB-D 图像预测出高通用性的人类抓取姿态,并能在零机器人数据(Zero-robot-data)的情况下,直接重定向到不同的多指机械手上,实现现实世界中的通用抓取。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本文提出了一种革命性的机器人灵巧抓取框架 HUG,它彻底抛弃了机器人专属数据,仅凭智能眼镜在日常环境中收集的 100 万帧人类抓取数据,结合创新的 RGB-D 流匹配模型预测人类手势并零样本重定向到机械手上,在真实世界的复杂物体抓取中以碾压性的优势(+34%)击败了现有最先进基线,为通用物理 AI 指明了“向人类直接学习”的新路径。
Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGS, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-BENCH, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-BENCH across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website. Keywords: Learning from Humans, Dexterous Grasping Figure 1: HUG learns dexterous grasping without any robot data. Trained solely on egocentric human grasp data, HUG generates diverse human grasps for real-world objects in a single RGBD image captured from a stereo camera, which can be retargeted to robot hands for zero-shot, in-thewild dexterous grasping.
https://arxiv.org/abs/2606.17054