Python编程工具
Search documents
14B打败671B,微软rStar2-Agent在数学推理上超过DeepSeek-R1
3 6 Ke· 2025-09-02 07:36
Core Insights - The article discusses the advancements in reasoning capabilities of large language models (LLMs) through test-time scaling and the introduction of agentic reinforcement learning, specifically highlighting the development of the rStar2-Agent model by a Microsoft research team [1][2]. Group 1: Model Development - Microsoft has developed a powerful agentic reinforcement learning method called rStar2-Agent, which includes a 14 billion parameter reasoning model that achieves state-of-the-art performance, surpassing even the 671 billion parameter DeepSeek-R1 model [2][17]. - The rStar2-Agent model demonstrates significant improvements in mathematical reasoning tasks, achieving an accuracy of 80.6% on the AIME24 benchmark, outperforming several leading models [19]. Group 2: Innovations and Techniques - The research team introduced three key innovations for the rStar2-Agent: 1. A high-throughput, independent code environment capable of handling 45,000 concurrent tool calls with an average feedback execution time of 0.3 seconds [10]. 2. A group relative policy optimization method (GRPO-RoC) that combines GRPO with correct resampling to address noise in the environment caused by sparse rewards [12][14]. 3. A training scheme that enhances a pre-trained 14 billion parameter model to achieve advanced reasoning capabilities with minimal computational resources [15][16]. Group 3: Performance Metrics - The rStar2-Agent-14B model achieved remarkable results in various reasoning benchmarks, including: - 80.6% accuracy on AIME24, 69.8% on AIME25, and 52.7% on HMMT25, demonstrating consistent high performance across tasks [19]. - It also outperformed DeepSeek-V3 in scientific reasoning benchmarks and showed competitive results in general alignment tests [22]. Group 4: Broader Implications - Despite being trained primarily on mathematical tasks, the rStar2-Agent model exhibits effective generalization capabilities, indicating its potential for broader applications in cognitive reasoning [21].