Workflow
大模型强化学习
icon
Search documents
大模型强化学习,相比PPO,DPO 还是个弟弟?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the theoretical and experimental shortcomings of DPO (Direct Preference Optimization) compared to PPO (Proximal Policy Optimization), highlighting that while DPO appears to lead in open-source benchmarks, top closed-source models like GPT-4 and Claude utilize PPO [1][2]. DPO's Deficiencies - DPO encounters issues similar to reward hacking, where it can produce solutions that do not align with human preferences, despite lacking an explicit reward model [2]. - The theoretical framework suggests that the strategies derived from PPO are a true subset of those from DPO when given true reward signals, indicating that DPO may generate solutions that deviate from reference strategies [3]. Experimental Findings - Experiments reveal that DPO can assign higher probabilities to data points not covered in the preference dataset, leading to unexpected behaviors, while PPO optimizes effectively under KL constraints [6]. - The performance of DPO can be improved by reducing distribution drift through methods like SafeSFT, but it still does not surpass PPO [8]. Performance Metrics - Benchmark results consistently show that PPO outperforms both iterative DPO and DPO in various tasks, particularly in programming competitions [10]. - Specific metrics indicate that models using PPO achieve significantly higher pass rates compared to those using DPO, with PPO models reaching up to 44.4% in pass@5 metrics, while DPO models struggle to achieve meaningful results [11][12]. Conclusion - The findings suggest that while DPO has theoretical merits, its practical application in high-stakes tasks like programming is limited compared to PPO, which continues to set new standards in performance [13].
大模型强化学习新突破——SPO新范式助力大模型推理能力提升!
机器之心· 2025-06-08 08:21
Core Viewpoint - The article discusses the potential of Reinforcement Learning (RL) in enhancing the reasoning capabilities of Large Language Models (LLMs), highlighting the effectiveness of models like DeepSeek R1, Kimi K1.5, and Qwen 3 in complex reasoning tasks [1]. Current Challenges - A fundamental challenge in effective RL is the credit assignment problem, which involves attributing the final evaluation of an LLM's response to specific decision actions (tokens) within the sequence [2]. - The difficulty arises from the sparse reward signals, which only provide clear success or failure feedback at the end of the sequence [3]. Current Methods - In RL, advantage value estimation is commonly used to address the credit assignment problem. Current methods for LLMs can be categorized into two types based on the granularity of advantage value estimation [5]. - Coarse-grained trajectory-level methods, like GRPO used in DeepSeek R1, calculate a single advantage value based on the final reward, which lacks the ability to reward correct parts of incorrect answers or penalize redundant parts of correct answers [6]. - Fine-grained token-level methods, such as PPO, estimate advantage values for each token but struggle with high estimation errors due to the significant differences in trajectory distributions across different prompts and limited sampling during training [6]. New SPO Framework - The research team from the Chinese Academy of Sciences and City University of Hong Kong proposed the Segment Policy Optimization (SPO) framework to overcome these limitations [8]. - SPO employs a medium-grained segment-level advantage value estimation approach, dividing generated sequences into connected segments to calculate advantage values for each segment [11]. Advantages of SPO - Improved credit assignment: The segment-level method provides localized advantage feedback, allowing the model to reward valuable parts of incorrect answers and penalize redundant segments in correct answers [12]. - More accurate advantage value estimation: The segment-level method requires fewer estimation points, effectively utilizing Monte Carlo sampling for unbiased advantage value estimation without relying on unstable critic models [12]. - Flexibility and adaptability: The segment division can be defined arbitrarily, allowing adjustments between token-level and trajectory-level granularity to suit different tasks and applications [12]. Core Components of SPO - The SPO framework consists of three core components: flexible segment division strategy, segment-level advantage value estimation based on Monte Carlo sampling, and policy optimization using segment-level advantages [13]. Specific Instances of SPO - The team proposed two specific instances of the SPO framework: SPO-chain for short chain-of-thought scenarios and SPO-tree for long chain-of-thought scenarios, enhancing Monte Carlo sampling efficiency [15]. Token Probability-Mask Strategy - A token probability-mask strategy was introduced to selectively compute losses for low-probability tokens within segments, which are critical decision points for segment-level advantage values [16]. Experimental Results - In short chain-of-thought scenarios, models trained with SPO achieved higher accuracy compared to various training algorithms [29]. - In long chain-of-thought scenarios, SPO-tree outperformed GRPO in accuracy while using the same base model and training time [31]. - The segment division method based on cutpoints showed the best performance in short chain-of-thought scenarios compared to other methods [36]. Conclusion - The work presents a reinforcement learning training framework, SPO, based on medium-grained segment-level advantage values, balancing between token-level and trajectory-level methods, offering better credit assignment and requiring fewer estimation points [42]. - The effectiveness of the SPO framework and its instances, SPO-chain and SPO-tree, has been validated through experiments [43].
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
Core Insights - The article discusses a recent breakthrough by the LeapLab team from Tsinghua University, revealing that only 20% of high-entropy tokens can significantly enhance the training effectiveness of large models in reinforcement learning, outperforming the use of all tokens [1][6]. Group 1: Research Findings - The team achieved new state-of-the-art (SOTA) records with the Qwen3-32B model, scoring 63.5 in AIME'24 and 56.7 in AIME'25, marking the highest scores for models with fewer than 600 billion parameters trained directly from the base model [2]. - The maximum response length was extended from 20k to 29k, resulting in a score increase to 68.1 in AIME'24 [4]. - The research challenges the classic Pareto principle, indicating that in large model reinforcement learning, 80% of low-entropy tokens can be discarded without detrimental effects, and may even have adverse impacts [5][6]. Group 2: Token Analysis - The study reveals a unique entropy distribution pattern during chain-of-thought reasoning, where over 50% of tokens have an entropy value below 0.01, while only 20% exceed 0.672 [9][10]. - High-entropy tokens serve as "logical connectors" in reasoning, while low-entropy tokens are often deterministic components, such as affixes or mathematical expressions [11]. - The team conducted experiments showing that increasing the temperature of high-entropy tokens improves reasoning performance, while lowering it decreases performance, underscoring the importance of maintaining high entropy in critical positions [13]. Group 3: Training Methodology - By focusing solely on the top 20% of high-entropy tokens during reinforcement learning training, the Qwen3-32B model saw significant performance improvements, with AIME'24 scores increasing by 7.71 points and AIME'25 by 11.04 points, alongside an average response length increase of approximately 1378 tokens [15][17]. - Similar performance enhancements were observed in the Qwen3-14B model, while the Qwen3-8B model maintained stable performance [16]. - Conversely, training with 80% low-entropy tokens led to a sharp decline in model performance, indicating their minimal contribution to reasoning capabilities [18]. Group 4: Implications and Generalization - The findings suggest that high-entropy tokens facilitate exploration of different reasoning paths, while low-entropy tokens may restrict this exploration due to their determinism [20]. - The advantages of training with high-entropy tokens become more pronounced with larger models, with the 32B model showing the most significant improvements [22]. - Models trained with high-entropy tokens also performed exceptionally well on out-of-domain tasks, indicating a potential link between high-entropy tokens and the model's generalization capabilities [22]. Group 5: Reinforcement Learning Insights - The research indicates that reinforcement learning with verifiable rewards (RLVR) does not completely overhaul the base model but rather fine-tunes it, maintaining a high overlap of 86.67% in high-entropy token positions even after extensive training [24][25]. - The study highlights that higher initial entropy in tokens correlates with greater increases in entropy during RLVR training, while low-entropy tokens remain largely unchanged [25]. - Discussions raised in the article suggest that high-entropy tokens may explain why reinforcement learning can generalize better than supervised fine-tuning, which tends to lead to memorization and overfitting [26][27].
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 近期arxiv最热门论文, Qwen&清华LeapLab 团队最新成果: 在强化学习训练大模型推理能力时, 仅仅20%的高熵token就能撑起整个训练效果 ,甚至比用全部token训练还要好。 团队用这个发现在Qwen3-32B上创造了新的SOTA记录:AIME'24上达到63.5分,AIME'25上达到56.7分, 这是600B参数以下直接从base模型训练的最高分。 最大响应长度从20k延长到29k,AIME'24的分数更是飙升到了68.1分。 揭开Chain-of-Thought的熵分布密码 要理解这项研究,需要先从一个有趣的观察说起: 团队发现,当大模型进行链式思考(Chain-of-Thought)推理时,token的熵分布呈现出一个独特的模式: 大部分token的熵都很低,只有少 数token表现出高熵特征 。 具体来说,超过50%的token熵值低于0.01,而只有20%的token熵值大于0.672。 经典的二八法则(或帕累托法则)指出,通常80%的结果由20%的关键因素驱动,但剩下80%也是不能轻易舍弃的。 但是在大模型强化学习这里,80 ...
10行代码,AIME24/25提高15%!揭秘大模型强化学习熵机制
机器之心· 2025-06-05 07:14
Core Insights - The article discusses the entropy collapse problem in reinforcement learning for large language models (LLMs) and proposes solutions to enhance exploration capabilities during training [3][5][24]. Group 1: Entropy Collapse in Reinforcement Learning - The core challenge in reinforcement learning is the trade-off between exploitation and exploration, where policy entropy is a key indicator of exploration potential [4]. - A significant finding is that policy entropy rapidly decreases to near zero within a few training steps, indicating a loss of exploration ability, which leads to performance stagnation [4][5]. - The relationship between policy entropy and downstream performance is quantitatively analyzed, revealing that performance is entirely determined by policy entropy in the absence of entropy interventions [4][5]. Group 2: Mechanisms Behind Entropy Changes - The study identifies the driving factors behind the changes in policy entropy during reinforcement learning, focusing on the covariance between action probabilities and their corresponding advantages [5][13]. - It is found that high-advantage and high-probability actions reduce policy entropy, while rare high-advantage actions increase it [13][17]. Group 3: Proposed Solutions for Enhancing Entropy - The article introduces two simple yet effective entropy-enhancing reinforcement learning strategies, Clip-Cov and KL-Cov, which can be implemented with minimal code changes [5][22]. - Experimental results demonstrate that these methods significantly improve performance, achieving a 6.4% increase on Qwen2.5-32B and up to 15% on challenging datasets like AIME24/25 [22][24]. - The research emphasizes the importance of maintaining exploration capabilities to achieve scalable reinforcement learning, suggesting that merely increasing computational power may yield limited benefits without addressing the entropy bottleneck [7][24].
大模型RL不止数学代码!7B奖励模型搞定医学法律经济全学科, 不用思维链也能做题
量子位· 2025-04-02 07:40
Core Insights - The article discusses a new framework called RLVR developed by Tencent and Suzhou University, which extends reinforcement learning training to various disciplines beyond mathematics and coding, including medicine, chemistry, law, psychology, and economics [3][4]. Group 1: Framework and Methodology - RLVR utilizes a model-based soft reward system, which shows significant improvements in generalization, robustness, and scalability compared to traditional binary rule-based rewards [4]. - The research is based on the observation that when tasks have objective reference answers, different large language models exhibit high consistency in binary judgments (correct/incorrect) [7]. - The team distilled a 7B reward model from a 72B parameter model (Qwen2.5-Instruct) without requiring domain-specific annotations, relying solely on data collected during the online exploration phase [9]. Group 2: Experimental Results - The study sampled 6,000 questions from ExamQA, covering a wide range of subjects in science, engineering, and humanities [12]. - The RM-7B model demonstrated superior performance in free-form answer tasks compared to various baseline models, including base models, fine-tuned models, and rule-based reinforcement learning [14]. - The RM-7B model achieved an average score of 62.5 in multi-subject tasks, outperforming other methods in both binary and soft reward categories [15]. Group 3: Scalability and Future Research - The research indicates that model-based rewards have better scalability when data volume increases, suggesting a more effective approach for handling non-structured reference answers [18]. - The authors note that while chain-of-thought reasoning (CoT) is beneficial in various scenarios, its necessity for evaluating semantic equivalence between reference answers and model responses remains an open question [16]. - The study does not impose format constraints on reference answers or model responses, which reduces the labor involved in data standardization, but the role of format-related constraints and rewards needs further examination [17].