Agentic RL
Search documents
2026年,大模型训练的下半场属于「强化学习云」
机器之心· 2026-01-12 05:01
编辑|Panda 2024 年底,硅谷和北京的茶水间里都在讨论同一个令人不安的话题: Scaling Law 似乎正在撞墙 。 那时候,尽管英伟达的股价还在狂飙,但多方信源显示,包括彼时备受期待的 Orion(原计划的 GPT-5)在内,新一代旗舰模型在单纯增加参数规模和训练数据 后,并未展现出预期的边际效益提升。另外,也有研究认为预训练所需的数据将会很快耗尽,其甚至还预测了明确的时间节点:2028 年。 来自论文 arXiv:2211.04325v2 OpenAI 和 Safe Superintelligence Inc 的联合创始人 Ilya Sutskever 当时还留下了一句意味深长的判词:「2010 年代是规模扩大的时代,现在人们又回到了奇迹和发 现的时代。」这句话在当时被许多人解读为悲观的预警,也就是单纯依靠堆砌算力和数据的预训练路线,恐怕已经触到了天花板。 直到 2025 年初,接连的惊喜打破了僵局。 那时候,OpenAI 的 o1 模型已在几个月前率先引入了强化推理,展示了模型在思考时间换取智能深度上的惊人潜力,证明了 test-time scaling(测试时间扩展)是一 条通往更高智能的可 ...
AEPO:智能体熵平衡策略优化,让探索更稳,推理更深!
机器之心· 2025-11-01 04:22
Core Insights - The article discusses the development of Agentic Entropy-Balanced Policy Optimization (AEPO), a new algorithm aimed at balancing exploration and stability in multi-round reinforcement learning for intelligent agents [2][10][11]. Group 1: Algorithm Overview - AEPO addresses the issues of "high-entropy rollout collapse" and "high-entropy gradient clipping" in existing Agentic RL methods, proposing two core mechanisms: dynamic entropy-balanced rollout sampling and entropy-balanced policy optimization [2][11]. - The algorithm has shown significant improvements over seven mainstream reinforcement learning algorithms across 14 cross-domain benchmarks, particularly in deep search tasks [4][12]. Group 2: Performance Metrics - AEPO achieved a Pass@5 score of 61.5% in deep search tasks, outperforming other methods such as ARPO and GRPO by an average of 5.8% [36][37]. - The algorithm maintains training stability while enhancing sampling diversity and reasoning efficiency, providing a new optimization paradigm for scalable reinforcement training of general intelligent agents [4][12]. Group 3: Research Motivation - The motivation behind AEPO is to find a balance in high-entropy environments, where excessive exploration can lead to instability and local optima [8][10]. - The research highlights the dual contradiction of high-entropy signals, which are necessary for exploration but can disrupt resource allocation and hinder learning [14][20]. Group 4: Future Directions - Future research may expand AEPO to multi-modal inputs, complex tool ecosystems, and multi-agent reinforcement learning scenarios to enhance collaborative strategies and performance [41].