大模型强化学习
Search documents
「上下文学习」之后,腾讯混元第二篇公开研究:精准定位RLVR训练崩溃的“罪魁祸首”Token
机器之心· 2026-02-14 04:54
Core Insights - The article discusses the introduction of Gradient Anomaly Localizer (GradLoc), a tool designed to enhance the observability of Reinforcement Learning with Verified Results (RLVR) training, aiming to reduce engineering barriers in the underlying physical and statistical mechanisms of RLVR [2][6][12] - The focus of large model competition is shifting from pre-training in 2024 to post-training in 2025, with RLVR facing high engineering hurdles despite algorithmic advancements [5][6] - GradLoc allows for precise identification of gradient spikes at the token level, transforming the debugging process from a black-box approach to a more scientific and data-driven methodology [10][12][31] Engineering Challenges - RLVR training is characterized by high noise and complexity, making it difficult to analyze and understand training dynamics due to the interdependence of data distribution and model parameters [5][6] - Traditional debugging methods rely heavily on expert intuition and global monitoring metrics, leading to long verification cycles and high time costs [8][12] GradLoc Implementation - GradLoc employs a binary search strategy to efficiently locate specific tokens causing gradient anomalies, reducing the complexity of issue identification from linear to logarithmic [14][16] - The tool dynamically adjusts detection thresholds to minimize false positives and negatives, ensuring effective anomaly detection without excessive computational costs [16][18] Systematic Iteration and Improvement - With GradLoc, developers can establish a systematic iteration loop that includes real-time localization, anomaly attribution, and targeted solutions, enhancing the overall understanding and application of various algorithm improvements [19][31] - The introduction of LayerClip, a method to address layer-wise gradient heterogeneity, further improves training stability by setting independent clipping thresholds for each layer [29][31] Future Outlook - The article emphasizes the importance of reducing observational barriers in underlying mechanisms, which will enable deeper exploration at the intersection of theory and application in large model training [36][37] - The ongoing development and open-sourcing of tools like GradLoc aim to make anomaly gradient localization as accessible as monitoring loss curves, fostering a more robust research environment [35][36]
揭秘!RLVR/GRPO中那些长期被忽略的关键缺陷
机器之心· 2026-01-30 08:49
Core Insights - The article discusses the advancements in large models, particularly focusing on the Reinforcement Learning with Verifiable Rewards (RLVR) technique, which allows models to improve through self-exploration rather than relying on external scoring [2] - A critical issue identified is the systematic bias in group-relative advantage estimation, where difficult problems are consistently underestimated while simple problems are overestimated [3][5] - The findings suggest that this bias is not merely a result of sampling noise but is inherent to the statistical structure of group-relative advantage estimation [6][23] Group-relative Advantage Estimation - Group-relative advantage estimation involves sampling multiple responses for a given prompt and calculating the average reward as a baseline for evaluating individual responses [9][10] - The expected reward and advantage are defined mathematically, with the expected advantage being the difference between the response reward and the expected reward [12][14] - The difficulty of a problem is categorized based on its expected reward, with values below 0.5 considered difficult and those above 0.5 considered simple [16] Systematic Bias in Estimation - The article presents a theorem indicating that group-relative advantage estimation exhibits systematic bias based on the difficulty of prompts, leading to underestimation for difficult prompts and overestimation for simple ones [23][30] - Visual analysis shows that the bias in advantage estimation increases as the difficulty of the prompt deviates from 0.5, with smaller sample sizes exacerbating the bias [24][25] - Specific examples illustrate how the estimation bias can significantly misrepresent the true advantages, particularly in challenging scenarios [26][28] Implications for RLVR Training - The systematic bias in group-relative advantage estimation can lead to imbalanced gradient signals during training, hindering effective exploration and favoring simple samples over challenging ones [40] - The article proposes an adaptive adjustment mechanism for advantage estimation based on prompt difficulty, suggesting that difficult prompts should have their estimated advantages amplified to encourage exploration, while simple prompts should have their advantages suppressed [40][42] - The introduction of the HA-DW algorithm aims to dynamically assess prompt difficulty and adjust advantage estimation accordingly, leading to improved performance on difficult prompts [41][42]
X上63万人围观的Traning-Free GRPO:把GRPO搬进上下文空间学习
机器之心· 2025-10-22 08:46
Core Viewpoint - The article discusses the introduction of Training-Free Group Relative Policy Optimization (GRPO), a method that allows for reinforcement learning (RL) without the need to update model parameters, making it more accessible and cost-effective for developers and smaller teams [4][20][28]. Summary by Sections GRPO Overview - GRPO has gained popularity in large model reinforcement learning, particularly for tasks like mathematical reasoning and multi-agent collaboration [2]. - The core mechanism of GRPO involves "multi-path parallelism + group advantage," which, while powerful, is costly in terms of model parameter optimization [3]. Training-Free GRPO - Tencent Youtu's recent paper proposes a solution to the high costs of parameter updates by moving the GRPO learning process into the context space, allowing for multiple answer paths to be generated and evaluated without changing model parameters [4][6]. - The method involves generating multiple rollout paths for the same problem, scoring them, and using the advantage signals to refine the model's preferences for high-quality solutions [4][10]. Experimental Results - In mathematical reasoning tasks, Training-Free GRPO can enhance performance using only 100 training samples at a cost of approximately $8 to $18 on a 671 billion parameter model [13][24]. - The method shows significant improvements in performance metrics, such as a 4.6% increase in Pass@1 in web search scenarios without updating model parameters [17][18]. Advantages of Training-Free GRPO - The approach retains the advantages of GRPO, including multi-path exploration and independent training/testing sets, while drastically reducing costs by eliminating the need for parameter updates [20][21]. - It allows for better generalization across different tasks without the complexity and maintenance costs associated with multiple specialized models [25]. Conclusion - Training-Free GRPO represents a shift in the understanding of reinforcement learning, demonstrating that effective RL can be achieved without traditional parameter updates, making it a viable option for developers with limited resources [26][28].
小米最新大模型成果!罗福莉现身了
自动驾驶之心· 2025-10-18 16:03
Core Insights - Xiaomi's AI team, in collaboration with Peking University, has recently published a paper focusing on MoE (Mixture of Experts) and reinforcement learning, revealing new advancements in large model training [2][8]. Group 1: Research Findings - The paper proposes a novel approach to enhance the stability and efficiency of large model reinforcement learning within the MoE framework [8][10]. - Current reinforcement learning methods face challenges in balancing efficiency and stability, often leading to catastrophic failures during training [14][24]. - The research introduces a method called Rollout Routing Replay (R3), which locks the routing distribution during inference and reuses it during training, ensuring consistency between the two phases [30][31]. Group 2: Experimental Results - Experiments conducted on the Qwen3-30B-A3B model demonstrate that R3 consistently outperforms other methods across various metrics, achieving higher scores in multiple scenarios [41][42]. - The introduction of R3 significantly reduces the occurrence of training crashes, maintaining a stable performance curve even after extended training periods [44][48]. - R3 not only stabilizes the model but also accelerates the optimization process, allowing for quicker identification of effective strategies [50]. Group 3: Team and Contributors - The research team includes notable contributors such as Wenhan Ma, a researcher from Xiaomi's LLM-Core team, and Luo Fuli, who has a strong academic background and has previously worked on significant AI projects [52][59]. - The paper also acknowledges the contributions of Professor Sui Zhifang from Peking University, who has extensive experience in computational linguistics and AI research [62][66].
小米最新大模型成果!罗福莉现身了
量子位· 2025-10-17 04:58
Core Viewpoint - Xiaomi's latest AI research paper, co-authored with Peking University, focuses on improving stability and efficiency in large model reinforcement learning using a new method called Rollout Routing Replay (R3) [2][7][49]. Group 1: Research Background - The collaboration between Xiaomi's AI team and Peking University has led to significant advancements in AI, particularly in the area of reinforcement learning [2][4]. - The paper addresses challenges in the Mixture of Experts (MoE) architecture, which can lead to instability during training due to routing mechanisms [8][25]. Group 2: Methodology - The proposed R3 method aims to stabilize the training process by locking the routing distribution during inference and replaying it during training, ensuring consistency between the two phases [28][30]. - Additionally, the research introduces a routing mask to cache routing decisions alongside context, enhancing computational efficiency [34][35]. Group 3: Experimental Results - Experiments conducted on the Qwen3-30B-A3B model show that R3 consistently outperforms other methods across various metrics, indicating improved overall performance [40][41]. - The stability of training has significantly improved, with R3 maintaining a smoother performance curve compared to traditional methods [43][46]. Group 4: Authors and Contributions - The first author, Wenhan Ma, is a researcher at Xiaomi's LLM-Core team, while the two corresponding authors are Luo Fuli and Professor Sui Zhifang from Peking University, both of whom have notable contributions to the field [51][56][61].
陈丹琦新作:大模型强化学习的第三条路,8B小模型超越GPT-4o
量子位· 2025-09-28 04:56
Core Viewpoint - The article discusses a new method called RLMT (Reinforcement Learning with Model-rewarded Thinking) that combines the advantages of RLHF (Reinforcement Learning from Human Feedback) and RLVR (Reinforcement Learning with Verifiable Rewards), enabling an 8 billion parameter model to outperform GPT-4o and rival Claude-3.7-Sonnet [1][4][11]. Group 1: Methodology and Performance - RLMT requires the model to generate a Chain of Thought (CoT) before producing an answer, which is then evaluated by a reward model trained on human preferences [5][17]. - The method can be directly applied to base models without the need for supervised fine-tuning (SFT), significantly reducing post-training costs [6][22]. - In benchmark tests, the L3.1-8B-RLMT model achieved an average score of 84.3, surpassing larger models like GPT-40 and Claude3.7-Sonnet [7]. Group 2: Training Process - The training process involves generating a reasoning trajectory based on user prompts, followed by scoring the final answer using a reward model [14]. - Two training approaches are highlighted: Warm-start (using SFT data) and Zero (direct training without SFT), both leading to improved performance [21][19]. - The RLMT method enhances the model's reasoning style to resemble human thought processes, resulting in higher quality dialogue and writing [19]. Group 3: Implications and Future Directions - The introduction of RLMT sets a new baseline for general reinforcement learning, emphasizing the importance of defining preferences in the post-training era [8]. - The results indicate that smaller models can achieve superior performance compared to larger models, suggesting a shift in focus towards efficiency in model training [22]. - The research team, led by Chen Danqi, aims to further explore natural language understanding and reasoning capabilities in future studies [24][25].
大模型强化学习,相比PPO,DPO 还是个弟弟?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the theoretical and experimental shortcomings of DPO (Direct Preference Optimization) compared to PPO (Proximal Policy Optimization), highlighting that while DPO appears to lead in open-source benchmarks, top closed-source models like GPT-4 and Claude utilize PPO [1][2]. DPO's Deficiencies - DPO encounters issues similar to reward hacking, where it can produce solutions that do not align with human preferences, despite lacking an explicit reward model [2]. - The theoretical framework suggests that the strategies derived from PPO are a true subset of those from DPO when given true reward signals, indicating that DPO may generate solutions that deviate from reference strategies [3]. Experimental Findings - Experiments reveal that DPO can assign higher probabilities to data points not covered in the preference dataset, leading to unexpected behaviors, while PPO optimizes effectively under KL constraints [6]. - The performance of DPO can be improved by reducing distribution drift through methods like SafeSFT, but it still does not surpass PPO [8]. Performance Metrics - Benchmark results consistently show that PPO outperforms both iterative DPO and DPO in various tasks, particularly in programming competitions [10]. - Specific metrics indicate that models using PPO achieve significantly higher pass rates compared to those using DPO, with PPO models reaching up to 44.4% in pass@5 metrics, while DPO models struggle to achieve meaningful results [11][12]. Conclusion - The findings suggest that while DPO has theoretical merits, its practical application in high-stakes tasks like programming is limited compared to PPO, which continues to set new standards in performance [13].
大模型强化学习新突破——SPO新范式助力大模型推理能力提升!
机器之心· 2025-06-08 08:21
Core Viewpoint - The article discusses the potential of Reinforcement Learning (RL) in enhancing the reasoning capabilities of Large Language Models (LLMs), highlighting the effectiveness of models like DeepSeek R1, Kimi K1.5, and Qwen 3 in complex reasoning tasks [1]. Current Challenges - A fundamental challenge in effective RL is the credit assignment problem, which involves attributing the final evaluation of an LLM's response to specific decision actions (tokens) within the sequence [2]. - The difficulty arises from the sparse reward signals, which only provide clear success or failure feedback at the end of the sequence [3]. Current Methods - In RL, advantage value estimation is commonly used to address the credit assignment problem. Current methods for LLMs can be categorized into two types based on the granularity of advantage value estimation [5]. - Coarse-grained trajectory-level methods, like GRPO used in DeepSeek R1, calculate a single advantage value based on the final reward, which lacks the ability to reward correct parts of incorrect answers or penalize redundant parts of correct answers [6]. - Fine-grained token-level methods, such as PPO, estimate advantage values for each token but struggle with high estimation errors due to the significant differences in trajectory distributions across different prompts and limited sampling during training [6]. New SPO Framework - The research team from the Chinese Academy of Sciences and City University of Hong Kong proposed the Segment Policy Optimization (SPO) framework to overcome these limitations [8]. - SPO employs a medium-grained segment-level advantage value estimation approach, dividing generated sequences into connected segments to calculate advantage values for each segment [11]. Advantages of SPO - Improved credit assignment: The segment-level method provides localized advantage feedback, allowing the model to reward valuable parts of incorrect answers and penalize redundant segments in correct answers [12]. - More accurate advantage value estimation: The segment-level method requires fewer estimation points, effectively utilizing Monte Carlo sampling for unbiased advantage value estimation without relying on unstable critic models [12]. - Flexibility and adaptability: The segment division can be defined arbitrarily, allowing adjustments between token-level and trajectory-level granularity to suit different tasks and applications [12]. Core Components of SPO - The SPO framework consists of three core components: flexible segment division strategy, segment-level advantage value estimation based on Monte Carlo sampling, and policy optimization using segment-level advantages [13]. Specific Instances of SPO - The team proposed two specific instances of the SPO framework: SPO-chain for short chain-of-thought scenarios and SPO-tree for long chain-of-thought scenarios, enhancing Monte Carlo sampling efficiency [15]. Token Probability-Mask Strategy - A token probability-mask strategy was introduced to selectively compute losses for low-probability tokens within segments, which are critical decision points for segment-level advantage values [16]. Experimental Results - In short chain-of-thought scenarios, models trained with SPO achieved higher accuracy compared to various training algorithms [29]. - In long chain-of-thought scenarios, SPO-tree outperformed GRPO in accuracy while using the same base model and training time [31]. - The segment division method based on cutpoints showed the best performance in short chain-of-thought scenarios compared to other methods [36]. Conclusion - The work presents a reinforcement learning training framework, SPO, based on medium-grained segment-level advantage values, balancing between token-level and trajectory-level methods, offering better credit assignment and requiring fewer estimation points [42]. - The effectiveness of the SPO framework and its instances, SPO-chain and SPO-tree, has been validated through experiments [43].
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
Core Insights - The article discusses a recent breakthrough by the LeapLab team from Tsinghua University, revealing that only 20% of high-entropy tokens can significantly enhance the training effectiveness of large models in reinforcement learning, outperforming the use of all tokens [1][6]. Group 1: Research Findings - The team achieved new state-of-the-art (SOTA) records with the Qwen3-32B model, scoring 63.5 in AIME'24 and 56.7 in AIME'25, marking the highest scores for models with fewer than 600 billion parameters trained directly from the base model [2]. - The maximum response length was extended from 20k to 29k, resulting in a score increase to 68.1 in AIME'24 [4]. - The research challenges the classic Pareto principle, indicating that in large model reinforcement learning, 80% of low-entropy tokens can be discarded without detrimental effects, and may even have adverse impacts [5][6]. Group 2: Token Analysis - The study reveals a unique entropy distribution pattern during chain-of-thought reasoning, where over 50% of tokens have an entropy value below 0.01, while only 20% exceed 0.672 [9][10]. - High-entropy tokens serve as "logical connectors" in reasoning, while low-entropy tokens are often deterministic components, such as affixes or mathematical expressions [11]. - The team conducted experiments showing that increasing the temperature of high-entropy tokens improves reasoning performance, while lowering it decreases performance, underscoring the importance of maintaining high entropy in critical positions [13]. Group 3: Training Methodology - By focusing solely on the top 20% of high-entropy tokens during reinforcement learning training, the Qwen3-32B model saw significant performance improvements, with AIME'24 scores increasing by 7.71 points and AIME'25 by 11.04 points, alongside an average response length increase of approximately 1378 tokens [15][17]. - Similar performance enhancements were observed in the Qwen3-14B model, while the Qwen3-8B model maintained stable performance [16]. - Conversely, training with 80% low-entropy tokens led to a sharp decline in model performance, indicating their minimal contribution to reasoning capabilities [18]. Group 4: Implications and Generalization - The findings suggest that high-entropy tokens facilitate exploration of different reasoning paths, while low-entropy tokens may restrict this exploration due to their determinism [20]. - The advantages of training with high-entropy tokens become more pronounced with larger models, with the 32B model showing the most significant improvements [22]. - Models trained with high-entropy tokens also performed exceptionally well on out-of-domain tasks, indicating a potential link between high-entropy tokens and the model's generalization capabilities [22]. Group 5: Reinforcement Learning Insights - The research indicates that reinforcement learning with verifiable rewards (RLVR) does not completely overhaul the base model but rather fine-tunes it, maintaining a high overlap of 86.67% in high-entropy token positions even after extensive training [24][25]. - The study highlights that higher initial entropy in tokens correlates with greater increases in entropy during RLVR training, while low-entropy tokens remain largely unchanged [25]. - Discussions raised in the article suggest that high-entropy tokens may explain why reinforcement learning can generalize better than supervised fine-tuning, which tends to lead to memorization and overfitting [26][27].
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 近期arxiv最热门论文, Qwen&清华LeapLab 团队最新成果: 在强化学习训练大模型推理能力时, 仅仅20%的高熵token就能撑起整个训练效果 ,甚至比用全部token训练还要好。 团队用这个发现在Qwen3-32B上创造了新的SOTA记录:AIME'24上达到63.5分,AIME'25上达到56.7分, 这是600B参数以下直接从base模型训练的最高分。 最大响应长度从20k延长到29k,AIME'24的分数更是飙升到了68.1分。 揭开Chain-of-Thought的熵分布密码 要理解这项研究,需要先从一个有趣的观察说起: 团队发现,当大模型进行链式思考(Chain-of-Thought)推理时,token的熵分布呈现出一个独特的模式: 大部分token的熵都很低,只有少 数token表现出高熵特征 。 具体来说,超过50%的token熵值低于0.01,而只有20%的token熵值大于0.672。 经典的二八法则(或帕累托法则)指出,通常80%的结果由20%的关键因素驱动,但剩下80%也是不能轻易舍弃的。 但是在大模型强化学习这里,80 ...