Workflow
Agentic RL
icon
Search documents
2026年,大模型训练的下半场属于「强化学习云」
机器之心· 2026-01-12 05:01
Core Insights - The article discusses the transition in AI model development from scaling laws based on increasing parameters and training data to a focus on reinforcement learning (RL) and post-training scaling, indicating a paradigm shift in AI capabilities [1][4][10]. Group 1: Scaling Law and Model Development - By the end of 2024, discussions in Silicon Valley and Beijing highlighted concerns that scaling laws were hitting a wall, as newer flagship models like Orion did not show expected marginal benefits from increased parameters and data [1]. - Ilya Sutskever's remark suggested a shift from an era of scaling to one of miracles and discoveries, indicating skepticism about the sustainability of the pre-training approach [3]. - By early 2025, OpenAI's o1 model introduced reinforcement reasoning, demonstrating that test-time scaling could lead to higher intelligence, while DeepSeek R1 successfully replicated this technology in an open-source manner [4][6]. Group 2: Reinforcement Learning and Infrastructure - The focus of computational power is shifting from pre-training scaling to post-training and test-time scaling, emphasizing the importance of deep reasoning capabilities over mere parameter size [8]. - The emergence of DeepSeek R1 revealed that deep reasoning, driven by reinforcement learning, is more critical for model evolution than simply increasing parameters [4][6]. - The industry is calling for a new computational infrastructure to support this shift towards dynamic exploration and reasoning, as existing cloud architectures struggle to meet these demands [11][12]. Group 3: Agentic RL and Its Implications - Nine Chapters Cloud has positioned itself as a leader in defining "reinforcement learning cloud" infrastructure, which is essential for the evolving AI landscape [12][14]. - The Agentic RL platform, launched in mid-2025, is the first industrial-grade reinforcement learning cloud platform, significantly enhancing training efficiency and reducing costs [15][19]. - Agentic RL aims to evolve general models into expert models capable of complex decision-making and control, addressing real-world challenges in various industries [20][22]. Group 4: Real-World Applications and Economic Impact - The successful implementation of a large-scale AI center in Huangshan within 48 days exemplifies Nine Chapters Cloud's engineering capabilities and operational efficiency [41][43]. - The Huangshan model is projected to generate significant economic benefits, with an estimated increase of at least 200 million yuan in annual service industry value [48]. - The integration of AI capabilities into urban management and tourism demonstrates the potential for AI infrastructure to drive economic growth and enhance operational efficiency [50][51]. Group 5: Future Vision and Market Position - Nine Chapters Cloud aims to establish itself as a key player in the independent AI cloud sector, advocating for an open ecosystem that does not compete with clients [54][60]. - The company emphasizes the importance of defining standards for next-generation infrastructure, moving beyond traditional cloud services to focus on enabling rapid evolution of intelligent agents [63][66]. - The future of cloud computing is envisioned as an "evolution era," where the focus will be on enhancing the capabilities of intelligent agents rather than merely providing computational resources [68][69].
AEPO:智能体熵平衡策略优化,让探索更稳,推理更深!
机器之心· 2025-11-01 04:22
Core Insights - The article discusses the development of Agentic Entropy-Balanced Policy Optimization (AEPO), a new algorithm aimed at balancing exploration and stability in multi-round reinforcement learning for intelligent agents [2][10][11]. Group 1: Algorithm Overview - AEPO addresses the issues of "high-entropy rollout collapse" and "high-entropy gradient clipping" in existing Agentic RL methods, proposing two core mechanisms: dynamic entropy-balanced rollout sampling and entropy-balanced policy optimization [2][11]. - The algorithm has shown significant improvements over seven mainstream reinforcement learning algorithms across 14 cross-domain benchmarks, particularly in deep search tasks [4][12]. Group 2: Performance Metrics - AEPO achieved a Pass@5 score of 61.5% in deep search tasks, outperforming other methods such as ARPO and GRPO by an average of 5.8% [36][37]. - The algorithm maintains training stability while enhancing sampling diversity and reasoning efficiency, providing a new optimization paradigm for scalable reinforcement training of general intelligent agents [4][12]. Group 3: Research Motivation - The motivation behind AEPO is to find a balance in high-entropy environments, where excessive exploration can lead to instability and local optima [8][10]. - The research highlights the dual contradiction of high-entropy signals, which are necessary for exploration but can disrupt resource allocation and hinder learning [14][20]. Group 4: Future Directions - Future research may expand AEPO to multi-modal inputs, complex tool ecosystems, and multi-agent reinforcement learning scenarios to enhance collaborative strategies and performance [41].