Workflow
RhoMath1.1B
icon
Search documents
大模型强化学习新突破——SPO新范式助力大模型推理能力提升!
机器之心· 2025-06-08 08:21
Core Viewpoint - The article discusses the potential of Reinforcement Learning (RL) in enhancing the reasoning capabilities of Large Language Models (LLMs), highlighting the effectiveness of models like DeepSeek R1, Kimi K1.5, and Qwen 3 in complex reasoning tasks [1]. Current Challenges - A fundamental challenge in effective RL is the credit assignment problem, which involves attributing the final evaluation of an LLM's response to specific decision actions (tokens) within the sequence [2]. - The difficulty arises from the sparse reward signals, which only provide clear success or failure feedback at the end of the sequence [3]. Current Methods - In RL, advantage value estimation is commonly used to address the credit assignment problem. Current methods for LLMs can be categorized into two types based on the granularity of advantage value estimation [5]. - Coarse-grained trajectory-level methods, like GRPO used in DeepSeek R1, calculate a single advantage value based on the final reward, which lacks the ability to reward correct parts of incorrect answers or penalize redundant parts of correct answers [6]. - Fine-grained token-level methods, such as PPO, estimate advantage values for each token but struggle with high estimation errors due to the significant differences in trajectory distributions across different prompts and limited sampling during training [6]. New SPO Framework - The research team from the Chinese Academy of Sciences and City University of Hong Kong proposed the Segment Policy Optimization (SPO) framework to overcome these limitations [8]. - SPO employs a medium-grained segment-level advantage value estimation approach, dividing generated sequences into connected segments to calculate advantage values for each segment [11]. Advantages of SPO - Improved credit assignment: The segment-level method provides localized advantage feedback, allowing the model to reward valuable parts of incorrect answers and penalize redundant segments in correct answers [12]. - More accurate advantage value estimation: The segment-level method requires fewer estimation points, effectively utilizing Monte Carlo sampling for unbiased advantage value estimation without relying on unstable critic models [12]. - Flexibility and adaptability: The segment division can be defined arbitrarily, allowing adjustments between token-level and trajectory-level granularity to suit different tasks and applications [12]. Core Components of SPO - The SPO framework consists of three core components: flexible segment division strategy, segment-level advantage value estimation based on Monte Carlo sampling, and policy optimization using segment-level advantages [13]. Specific Instances of SPO - The team proposed two specific instances of the SPO framework: SPO-chain for short chain-of-thought scenarios and SPO-tree for long chain-of-thought scenarios, enhancing Monte Carlo sampling efficiency [15]. Token Probability-Mask Strategy - A token probability-mask strategy was introduced to selectively compute losses for low-probability tokens within segments, which are critical decision points for segment-level advantage values [16]. Experimental Results - In short chain-of-thought scenarios, models trained with SPO achieved higher accuracy compared to various training algorithms [29]. - In long chain-of-thought scenarios, SPO-tree outperformed GRPO in accuracy while using the same base model and training time [31]. - The segment division method based on cutpoints showed the best performance in short chain-of-thought scenarios compared to other methods [36]. Conclusion - The work presents a reinforcement learning training framework, SPO, based on medium-grained segment-level advantage values, balancing between token-level and trajectory-level methods, offering better credit assignment and requiring fewer estimation points [42]. - The effectiveness of the SPO framework and its instances, SPO-chain and SPO-tree, has been validated through experiments [43].