Workflow
RandOpt算法
icon
Search documents
MIT新研究:大模型加噪声就能替代GRPO/PPO调参
量子位· 2026-03-16 06:11
Core Viewpoint - A new paper from MIT suggests that by simply adding Gaussian noise to pre-trained models, performance can match or even exceed that of traditional tuning algorithms like GRPO/PPO, thus simplifying the tuning process significantly [1][3][7]. Group 1: Findings on Pre-trained Models - The paper reveals that expert models already exist within the weight space of pre-trained models, described as a "Neural Thicket" phenomenon, where small perturbations can uncover task-specific experts [6][9][26]. - The authors propose a method called RandOpt, which involves adding Gaussian noise to large language models and integrating the results, achieving comparable or superior performance in various tasks without complex tuning [7][35]. - Larger models exhibit better performance due to denser regions of effective perturbations surrounding their weights, making it easier to find task-specific improvements [8][16][17]. Group 2: Mechanism of RandOpt - RandOpt operates in two simple steps: randomly perturbing the model parameters to find "expert" versions and then using a voting mechanism to determine the best output from these models [28][32]. - The method allows for testing different noise strengths to ensure a variety of expert types are identified, and it can run multiple models simultaneously on different GPUs for efficiency [33][34]. - Initial results indicate that RandOpt achieves accuracy levels similar to or higher than mainstream tuning methods across various tasks, including language and visual-language models [35][38]. Group 3: Implications and Limitations - The research emphasizes the need for high-quality pre-training, as the effectiveness of RandOpt relies on the model's initial training data [58]. - While RandOpt can enhance performance in specific tasks, it cannot enable the model to learn new skills beyond its pre-trained capabilities [58]. - The approach is best suited for tasks with clear answers, such as structured generation tasks, and may require further refinement for more complex tasks [59].
后训练中的RL已死?MIT新算法挑战传统后训练思维,谢赛宁转发
机器之心· 2026-03-15 06:00
机器之心编辑部 这一发现对大模型参数空间的理解具有颠覆性意义。早在 2001 年,Schmidhuber 等人提出「随机猜测」不能算作一种有效的学习算法,认为「优秀的解 决方案在权重空间中的分布必须极其稀疏」。然而,Gan 和 Isola 的研究揭示了一个反直觉的现象:在完成预训练后,LLM 模型的权重空间实际上形成了 一个密集的 「神经丛林」(Neural Thickets) ,这一状态促使简单的随机采样就能发现有效的解决方案。 论文指出,预训练模型不仅仅是后训练的「起点」,其权重空间内已潜藏着大量任务专家。随着模型规模的增大,这些专家在权重空间中的分布密度急剧增 加,足以让随机扰动和集成方法有效捕捉优越的解决方案。 基于这一理论,RandOpt 算法的操作方式非常简单:只需向预训练模型添加单步的高斯噪声(无需任何迭代、学习率或梯度计算),并对多个扰动后的模 型副本进行集成。实验结果表明,仅凭这一极简的操作,模型就能够在数学推理、代码生成等复杂任务中达到,甚至超越 PPO 或 GRPO 等传统后训练方 法的性能。 在当前的 LLM 开发中,后训练阶段通常被视为赋予模型特定能力的关键环节。传统的观点认为,模型 ...