14B打败671B！微软rStar2-Agent在数学推理上超过DeepSeek-R1

Core Viewpoint - The article discusses the advancements in large language models (LLMs) through the introduction of rStar2-Agent, a powerful agentic reinforcement learning method developed by Microsoft Research, which enhances reasoning capabilities and performance in mathematical reasoning tasks. Group 1: Model Development and Innovations - The rStar2-Agent model utilizes test-time scaling to enhance reasoning capabilities, allowing for longer and smarter thinking processes through the integration of advanced cognitive abilities and tool interactions [1][2]. - The model was trained using a 14 billion parameter architecture, achieving performance levels comparable to or exceeding that of larger models like DeepSeek-R1, which has 671 billion parameters [2][25]. - The training infrastructure developed for rStar2-Agent can handle 45,000 concurrent tool calls with an average feedback execution time of just 0.3 seconds, significantly improving training efficiency [14][13]. Group 2: Training Methodology - The team introduced a novel training scheme that begins with a non-reasoning supervised fine-tuning (SFT) phase, focusing on general instruction following and tool usage, which helps avoid overfitting and maintains shorter initial responses [21][19]. - The GRPO-RoC method was implemented to enhance the efficiency of active reinforcement learning in the coding environment, allowing for better handling of noise and improving the quality of training trajectories [19][18]. - The model achieved state-of-the-art mathematical reasoning performance with only 510 reinforcement learning steps, demonstrating exceptional training efficiency [23][25]. Group 3: Performance Metrics - rStar2-Agent-14B achieved an accuracy of 80.6% on the AIME24 benchmark, outperforming other models such as o3-mini, DeepSeek-R1, and Claude Opus 4.0 by margins of 1.0%, 0.8%, and 3.6% respectively [26]. - The model exhibited strong generalization capabilities beyond mathematics, despite being trained primarily on mathematical tasks [27]. - In terms of response length, rStar2-Agent-14B produced shorter average responses compared to larger models, indicating a more efficient reasoning process [29].