爱可可AI前沿速递(5.12)
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AI - 人工智能
1、[LG] 想象空间内的训练机制 2、[LG] 深度网络谱动力学:特征习得、异常规避与学习率迁移 3、[CL] 极速字节潜变量Transformer 4、[LG] 探究神经缩放律的普适性与不变性 5、[AI] 生成式AI模型代际更迭中认知能力的非均衡演进
摘要:探讨“想象”式训练、深度网络谱动力学、快速字节潜变量Transformer、神经缩放律的不变性及普适性、生成式AI代际演进中认知能力的非均衡表现
N Timor, R Shwartz-Ziv, M Goldblum, Y LeCun… [Weizmann Institute of Science & New York University & Columbia University]
论“想象”式训练
要点:
主旨: 本文深入剖析了基于模型的强化学习(MBRL)中“在想象中训练(Training in Imagination)”范式的理论根基与实践机制。文章致力于解决:如何量化动力学模型与奖励模型误差对策略回报的具体影响;在资源预算受限场景下,如何最优分配动力学样本与奖励标注样本;以及在面对含噪或存在偏差的奖励信号时,策略梯度优化(REINFORCE)展现出何种容忍度及内在权衡。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本文通过拆解世界模型中的动力学误差与奖励误差,结合神经缩放定律推导出最优数据预算分配的闭式解,并反直觉地揭示:尽管全局理论边界极为宽松,但其推导出的比例结构在局部却极其精准,为“在想象中训练”的成本管控与抗噪优化提供了坚实的数学理论支撑。
State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. [2018b] to MDPs with learned reward models, and derive the optimal sample allocation—the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. [2026]. Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.
https://arxiv.org/abs/2605.06732
C Lauditi, C Pehlevan, B Bordelon [Harvard University]
深度网络中的谱动力学:特征学习、离群值逃逸与学习率迁移
要点:
主旨: 本文致力于研究宽神经网络在(随机)梯度下降训练期间,隐藏层权重矩阵频谱(奇异值/特征值)的演变规律。文章试图解答在“特征学习(Feature Learning)”机制下,权重矩阵如何从初始的纯随机状态,演变为承载特定任务知识的结构化方向(离群值),并以此解释为何某些网络参数化方法(如P)能实现跨模型尺寸的超参数迁移。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本文提出双层动态平均场理论以追踪深度神经网络权重的频谱演化,从机制层面证实P缩放通过稳定频谱离群值实现了跨宽度的学习率迁移,并反直觉地发现大语言模型的海量输出任务会直接重塑整个权重的随机频谱主体,突破了经典“主体+离群值”理论的限制。
We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in meanfield/µP scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, µP yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.
https://arxiv.org/abs/2605.07870
J Kallini, A Pagnoni, T Limisiewicz, G Ghosh, L Zettlemoyer, C Potts, X Han… [FAIR at Meta]
快速字节潜变量Transformer
要点:
主旨: 本文旨在解决字节级语言模型(Byte-level LMs)虽无需分词器(Tokenizer-free)但推理速度迟缓的痛点。通过将离散文本扩散机制与基于架构自身的推测解码技术引入层级字节模型(BLT),大幅削减了模型推理时的前向传播次数及内存带宽消耗,实现了字节级模型的并行加速生成。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本文通过将块状离散扩散机制和无辅助的自推测解码技术引入层级字节语言模型(BLT),在确保生成质量的前提下,将字节级生成的内存带宽成本降低50%以上,消除了无分词器大模型走向实用化的关键效率瓶颈。
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
https://arxiv.org/abs/2605.08044
X Han, Z Liu, S Saria, P P Liang [Johns Hopkins University & MIT]
论神经缩放律的不变性与普适性
要点:
主旨: 本文旨在解决为新模型和新任务拟合神经缩放定律(NSLs)成本高昂的难题。文章提出了一种基于“信息分辨率(Information Resolution)”的理论框架,使得在资源充沛的源领域(如通用文本)拟合的缩放定律,能够可靠地迁移至计算资源或数据受限的新领域(如医疗文本、含噪时间序列),而无需重新进行极其昂贵的超参数与规模扫描训练。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 通过引入“信息分辨率(Information Resolution)”来量化数据变换与退化,本文构建了一个在双射变换下保持不变、且在数据受损时仍能精准预测的“神经缩放定律”,使我们有能力以近乎零成本预测新领域的模型性能,并深刻揭示了“数据质量越低,最优模型尺寸越小”的本质规律。
Neural scaling laws establish a predictable relationship between model performance and data or compute, offering crucial guidance for resource allocation in new domains and tasks. Yet such laws are most needed precisely where they are hardest to obtain: fitting one for a new model–task pair demands expensive sweeps that typically exhaust the very compute budget the law is meant to economize. This paper poses the research question of how to develop generalizable scaling laws: laws fit once on a well-resourced source domain and reliably transported to new domains where running a full sweep is infeasible, which requires a fundamental understanding of when and why scaling properties change. We address this by identifying the right invariants: scaling laws are preserved under bijective (information-preserving) transformations of the data and modified in predictable, information-theoretically grounded ways under non-bijective transformations that lower its information resolution ρ: a single axis along which a law fit in one domain can be transported to another. We validate this across language, vision, and speech, and demonstrate two cross-domain applications: predicting scaling for language models trained on electronic health records from laws fit on general text, and predicting time-series classification scaling under varying levels of noise injection, recovering the data-scaling exponents to within 3% error.
https://arxiv.org/abs/2605.07546
I Galatzer-Levy, D McDuff, X Liu, J McGiffin [Google DeepMind & Google Research]
生成式 AI 模型代际演进中的认知能力非均衡现象
要点:
主旨: 本文旨在探讨当前生成式大模型在迈向AGI的征途中,其底层认知能力是否实现了全面且均衡的发展。通过应用传统心理学测试(WAIS-IV)并开发难度可无限扩展的AI专属智商基准(AIQ),研究量化了多代大模型在语言、工作记忆及视觉推理等核心认知领域的演进轨迹,从而审视当前深度学习架构的局限性。
创新:
贡献:
提升:
不足:
心得:
一句话总结: 本文通过引入超越人类极限的AIQ心理测量基准,揭示了当前多模态大模型在认知进化上存在严重的不均衡性(语言能力登峰造极、视觉推理处于低位),反直觉地证明了相同逻辑的推理任务在不同模态下表现迥异,指出仅靠“扩大规模(Scaling)”无法突破模型缺乏“隐式世界模型”的架构瓶颈以实现真正的AGI。
The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>98th98^{ ext{th}} percentile) contrasted with near-floor performance in perceptual reasoning (<1st1^{ ext{st}} percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.
https://arxiv.org/abs/2605.06815