主动式强化学习 - filings, earnings calls, financial reports, news

主动式强化学习

Search documents

14B打败671B，微软rStar2-Agent在数学推理上超过DeepSeek-R1

3 6 Ke· 2025-09-02 07:36

Core Insights - The article discusses the advancements in reasoning capabilities of large language models (LLMs) through test-time scaling and the introduction of agentic reinforcement learning, specifically highlighting the development of the rStar2-Agent model by a Microsoft research team [1][2]. Group 1: Model Development - Microsoft has developed a powerful agentic reinforcement learning method called rStar2-Agent, which includes a 14 billion parameter reasoning model that achieves state-of-the-art performance, surpassing even the 671 billion parameter DeepSeek-R1 model [2][17]. - The rStar2-Agent model demonstrates significant improvements in mathematical reasoning tasks, achieving an accuracy of 80.6% on the AIME24 benchmark, outperforming several leading models [19]. Group 2: Innovations and Techniques - The research team introduced three key innovations for the rStar2-Agent: 1. A high-throughput, independent code environment capable of handling 45,000 concurrent tool calls with an average feedback execution time of 0.3 seconds [10]. 2. A group relative policy optimization method (GRPO-RoC) that combines GRPO with correct resampling to address noise in the environment caused by sparse rewards [12][14]. 3. A training scheme that enhances a pre-trained 14 billion parameter model to achieve advanced reasoning capabilities with minimal computational resources [15][16]. Group 3: Performance Metrics - The rStar2-Agent-14B model achieved remarkable results in various reasoning benchmarks, including: - 80.6% accuracy on AIME24, 69.8% on AIME25, and 52.7% on HMMT25, demonstrating consistent high performance across tasks [19]. - It also outperformed DeepSeek-V3 in scientific reasoning benchmarks and showed competitive results in general alignment tests [22]. Group 4: Broader Implications - Despite being trained primarily on mathematical tasks, the rStar2-Agent model exhibits effective generalization capabilities, indicating its potential for broader applications in cognitive reasoning [21].

14B打败671B！微软rStar2-Agent在数学推理上超过DeepSeek-R1

机器之心· 2025-09-02 01:27

Core Viewpoint - The article discusses the advancements in large language models (LLMs) through the introduction of rStar2-Agent, a powerful agentic reinforcement learning method developed by Microsoft Research, which enhances reasoning capabilities and performance in mathematical reasoning tasks. Group 1: Model Development and Innovations - The rStar2-Agent model utilizes test-time scaling to enhance reasoning capabilities, allowing for longer and smarter thinking processes through the integration of advanced cognitive abilities and tool interactions [1][2]. - The model was trained using a 14 billion parameter architecture, achieving performance levels comparable to or exceeding that of larger models like DeepSeek-R1, which has 671 billion parameters [2][25]. - The training infrastructure developed for rStar2-Agent can handle 45,000 concurrent tool calls with an average feedback execution time of just 0.3 seconds, significantly improving training efficiency [14][13]. Group 2: Training Methodology - The team introduced a novel training scheme that begins with a non-reasoning supervised fine-tuning (SFT) phase, focusing on general instruction following and tool usage, which helps avoid overfitting and maintains shorter initial responses [21][19]. - The GRPO-RoC method was implemented to enhance the efficiency of active reinforcement learning in the coding environment, allowing for better handling of noise and improving the quality of training trajectories [19][18]. - The model achieved state-of-the-art mathematical reasoning performance with only 510 reinforcement learning steps, demonstrating exceptional training efficiency [23][25]. Group 3: Performance Metrics - rStar2-Agent-14B achieved an accuracy of 80.6% on the AIME24 benchmark, outperforming other models such as o3-mini, DeepSeek-R1, and Claude Opus 4.0 by margins of 1.0%, 0.8%, and 3.6% respectively [26]. - The model exhibited strong generalization capabilities beyond mathematics, despite being trained primarily on mathematical tasks [27]. - In terms of response length, rStar2-Agent-14B produced shorter average responses compared to larger models, indicating a more efficient reasoning process [29].