大模型强化学习
Search documents
X上63万人围观的Traning-Free GRPO:把GRPO搬进上下文空间学习
机器之心· 2025-10-22 08:46
Core Viewpoint - The article discusses the introduction of Training-Free Group Relative Policy Optimization (GRPO), a method that allows for reinforcement learning (RL) without the need to update model parameters, making it more accessible and cost-effective for developers and smaller teams [4][20][28]. Summary by Sections GRPO Overview - GRPO has gained popularity in large model reinforcement learning, particularly for tasks like mathematical reasoning and multi-agent collaboration [2]. - The core mechanism of GRPO involves "multi-path parallelism + group advantage," which, while powerful, is costly in terms of model parameter optimization [3]. Training-Free GRPO - Tencent Youtu's recent paper proposes a solution to the high costs of parameter updates by moving the GRPO learning process into the context space, allowing for multiple answer paths to be generated and evaluated without changing model parameters [4][6]. - The method involves generating multiple rollout paths for the same problem, scoring them, and using the advantage signals to refine the model's preferences for high-quality solutions [4][10]. Experimental Results - In mathematical reasoning tasks, Training-Free GRPO can enhance performance using only 100 training samples at a cost of approximately $8 to $18 on a 671 billion parameter model [13][24]. - The method shows significant improvements in performance metrics, such as a 4.6% increase in Pass@1 in web search scenarios without updating model parameters [17][18]. Advantages of Training-Free GRPO - The approach retains the advantages of GRPO, including multi-path exploration and independent training/testing sets, while drastically reducing costs by eliminating the need for parameter updates [20][21]. - It allows for better generalization across different tasks without the complexity and maintenance costs associated with multiple specialized models [25]. Conclusion - Training-Free GRPO represents a shift in the understanding of reinforcement learning, demonstrating that effective RL can be achieved without traditional parameter updates, making it a viable option for developers with limited resources [26][28].
小米最新大模型成果!罗福莉现身了
自动驾驶之心· 2025-10-18 16:03
Core Insights - Xiaomi's AI team, in collaboration with Peking University, has recently published a paper focusing on MoE (Mixture of Experts) and reinforcement learning, revealing new advancements in large model training [2][8]. Group 1: Research Findings - The paper proposes a novel approach to enhance the stability and efficiency of large model reinforcement learning within the MoE framework [8][10]. - Current reinforcement learning methods face challenges in balancing efficiency and stability, often leading to catastrophic failures during training [14][24]. - The research introduces a method called Rollout Routing Replay (R3), which locks the routing distribution during inference and reuses it during training, ensuring consistency between the two phases [30][31]. Group 2: Experimental Results - Experiments conducted on the Qwen3-30B-A3B model demonstrate that R3 consistently outperforms other methods across various metrics, achieving higher scores in multiple scenarios [41][42]. - The introduction of R3 significantly reduces the occurrence of training crashes, maintaining a stable performance curve even after extended training periods [44][48]. - R3 not only stabilizes the model but also accelerates the optimization process, allowing for quicker identification of effective strategies [50]. Group 3: Team and Contributors - The research team includes notable contributors such as Wenhan Ma, a researcher from Xiaomi's LLM-Core team, and Luo Fuli, who has a strong academic background and has previously worked on significant AI projects [52][59]. - The paper also acknowledges the contributions of Professor Sui Zhifang from Peking University, who has extensive experience in computational linguistics and AI research [62][66].
小米最新大模型成果!罗福莉现身了
量子位· 2025-10-17 04:58
Core Viewpoint - Xiaomi's latest AI research paper, co-authored with Peking University, focuses on improving stability and efficiency in large model reinforcement learning using a new method called Rollout Routing Replay (R3) [2][7][49]. Group 1: Research Background - The collaboration between Xiaomi's AI team and Peking University has led to significant advancements in AI, particularly in the area of reinforcement learning [2][4]. - The paper addresses challenges in the Mixture of Experts (MoE) architecture, which can lead to instability during training due to routing mechanisms [8][25]. Group 2: Methodology - The proposed R3 method aims to stabilize the training process by locking the routing distribution during inference and replaying it during training, ensuring consistency between the two phases [28][30]. - Additionally, the research introduces a routing mask to cache routing decisions alongside context, enhancing computational efficiency [34][35]. Group 3: Experimental Results - Experiments conducted on the Qwen3-30B-A3B model show that R3 consistently outperforms other methods across various metrics, indicating improved overall performance [40][41]. - The stability of training has significantly improved, with R3 maintaining a smoother performance curve compared to traditional methods [43][46]. Group 4: Authors and Contributions - The first author, Wenhan Ma, is a researcher at Xiaomi's LLM-Core team, while the two corresponding authors are Luo Fuli and Professor Sui Zhifang from Peking University, both of whom have notable contributions to the field [51][56][61].
陈丹琦新作:大模型强化学习的第三条路,8B小模型超越GPT-4o
量子位· 2025-09-28 04:56
Core Viewpoint - The article discusses a new method called RLMT (Reinforcement Learning with Model-rewarded Thinking) that combines the advantages of RLHF (Reinforcement Learning from Human Feedback) and RLVR (Reinforcement Learning with Verifiable Rewards), enabling an 8 billion parameter model to outperform GPT-4o and rival Claude-3.7-Sonnet [1][4][11]. Group 1: Methodology and Performance - RLMT requires the model to generate a Chain of Thought (CoT) before producing an answer, which is then evaluated by a reward model trained on human preferences [5][17]. - The method can be directly applied to base models without the need for supervised fine-tuning (SFT), significantly reducing post-training costs [6][22]. - In benchmark tests, the L3.1-8B-RLMT model achieved an average score of 84.3, surpassing larger models like GPT-40 and Claude3.7-Sonnet [7]. Group 2: Training Process - The training process involves generating a reasoning trajectory based on user prompts, followed by scoring the final answer using a reward model [14]. - Two training approaches are highlighted: Warm-start (using SFT data) and Zero (direct training without SFT), both leading to improved performance [21][19]. - The RLMT method enhances the model's reasoning style to resemble human thought processes, resulting in higher quality dialogue and writing [19]. Group 3: Implications and Future Directions - The introduction of RLMT sets a new baseline for general reinforcement learning, emphasizing the importance of defining preferences in the post-training era [8]. - The results indicate that smaller models can achieve superior performance compared to larger models, suggesting a shift in focus towards efficiency in model training [22]. - The research team, led by Chen Danqi, aims to further explore natural language understanding and reasoning capabilities in future studies [24][25].
大模型强化学习,相比PPO,DPO 还是个弟弟?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the theoretical and experimental shortcomings of DPO (Direct Preference Optimization) compared to PPO (Proximal Policy Optimization), highlighting that while DPO appears to lead in open-source benchmarks, top closed-source models like GPT-4 and Claude utilize PPO [1][2]. DPO's Deficiencies - DPO encounters issues similar to reward hacking, where it can produce solutions that do not align with human preferences, despite lacking an explicit reward model [2]. - The theoretical framework suggests that the strategies derived from PPO are a true subset of those from DPO when given true reward signals, indicating that DPO may generate solutions that deviate from reference strategies [3]. Experimental Findings - Experiments reveal that DPO can assign higher probabilities to data points not covered in the preference dataset, leading to unexpected behaviors, while PPO optimizes effectively under KL constraints [6]. - The performance of DPO can be improved by reducing distribution drift through methods like SafeSFT, but it still does not surpass PPO [8]. Performance Metrics - Benchmark results consistently show that PPO outperforms both iterative DPO and DPO in various tasks, particularly in programming competitions [10]. - Specific metrics indicate that models using PPO achieve significantly higher pass rates compared to those using DPO, with PPO models reaching up to 44.4% in pass@5 metrics, while DPO models struggle to achieve meaningful results [11][12]. Conclusion - The findings suggest that while DPO has theoretical merits, its practical application in high-stakes tasks like programming is limited compared to PPO, which continues to set new standards in performance [13].
大模型强化学习新突破——SPO新范式助力大模型推理能力提升!
机器之心· 2025-06-08 08:21
Core Viewpoint - The article discusses the potential of Reinforcement Learning (RL) in enhancing the reasoning capabilities of Large Language Models (LLMs), highlighting the effectiveness of models like DeepSeek R1, Kimi K1.5, and Qwen 3 in complex reasoning tasks [1]. Current Challenges - A fundamental challenge in effective RL is the credit assignment problem, which involves attributing the final evaluation of an LLM's response to specific decision actions (tokens) within the sequence [2]. - The difficulty arises from the sparse reward signals, which only provide clear success or failure feedback at the end of the sequence [3]. Current Methods - In RL, advantage value estimation is commonly used to address the credit assignment problem. Current methods for LLMs can be categorized into two types based on the granularity of advantage value estimation [5]. - Coarse-grained trajectory-level methods, like GRPO used in DeepSeek R1, calculate a single advantage value based on the final reward, which lacks the ability to reward correct parts of incorrect answers or penalize redundant parts of correct answers [6]. - Fine-grained token-level methods, such as PPO, estimate advantage values for each token but struggle with high estimation errors due to the significant differences in trajectory distributions across different prompts and limited sampling during training [6]. New SPO Framework - The research team from the Chinese Academy of Sciences and City University of Hong Kong proposed the Segment Policy Optimization (SPO) framework to overcome these limitations [8]. - SPO employs a medium-grained segment-level advantage value estimation approach, dividing generated sequences into connected segments to calculate advantage values for each segment [11]. Advantages of SPO - Improved credit assignment: The segment-level method provides localized advantage feedback, allowing the model to reward valuable parts of incorrect answers and penalize redundant segments in correct answers [12]. - More accurate advantage value estimation: The segment-level method requires fewer estimation points, effectively utilizing Monte Carlo sampling for unbiased advantage value estimation without relying on unstable critic models [12]. - Flexibility and adaptability: The segment division can be defined arbitrarily, allowing adjustments between token-level and trajectory-level granularity to suit different tasks and applications [12]. Core Components of SPO - The SPO framework consists of three core components: flexible segment division strategy, segment-level advantage value estimation based on Monte Carlo sampling, and policy optimization using segment-level advantages [13]. Specific Instances of SPO - The team proposed two specific instances of the SPO framework: SPO-chain for short chain-of-thought scenarios and SPO-tree for long chain-of-thought scenarios, enhancing Monte Carlo sampling efficiency [15]. Token Probability-Mask Strategy - A token probability-mask strategy was introduced to selectively compute losses for low-probability tokens within segments, which are critical decision points for segment-level advantage values [16]. Experimental Results - In short chain-of-thought scenarios, models trained with SPO achieved higher accuracy compared to various training algorithms [29]. - In long chain-of-thought scenarios, SPO-tree outperformed GRPO in accuracy while using the same base model and training time [31]. - The segment division method based on cutpoints showed the best performance in short chain-of-thought scenarios compared to other methods [36]. Conclusion - The work presents a reinforcement learning training framework, SPO, based on medium-grained segment-level advantage values, balancing between token-level and trajectory-level methods, offering better credit assignment and requiring fewer estimation points [42]. - The effectiveness of the SPO framework and its instances, SPO-chain and SPO-tree, has been validated through experiments [43].
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
Core Insights - The article discusses a recent breakthrough by the LeapLab team from Tsinghua University, revealing that only 20% of high-entropy tokens can significantly enhance the training effectiveness of large models in reinforcement learning, outperforming the use of all tokens [1][6]. Group 1: Research Findings - The team achieved new state-of-the-art (SOTA) records with the Qwen3-32B model, scoring 63.5 in AIME'24 and 56.7 in AIME'25, marking the highest scores for models with fewer than 600 billion parameters trained directly from the base model [2]. - The maximum response length was extended from 20k to 29k, resulting in a score increase to 68.1 in AIME'24 [4]. - The research challenges the classic Pareto principle, indicating that in large model reinforcement learning, 80% of low-entropy tokens can be discarded without detrimental effects, and may even have adverse impacts [5][6]. Group 2: Token Analysis - The study reveals a unique entropy distribution pattern during chain-of-thought reasoning, where over 50% of tokens have an entropy value below 0.01, while only 20% exceed 0.672 [9][10]. - High-entropy tokens serve as "logical connectors" in reasoning, while low-entropy tokens are often deterministic components, such as affixes or mathematical expressions [11]. - The team conducted experiments showing that increasing the temperature of high-entropy tokens improves reasoning performance, while lowering it decreases performance, underscoring the importance of maintaining high entropy in critical positions [13]. Group 3: Training Methodology - By focusing solely on the top 20% of high-entropy tokens during reinforcement learning training, the Qwen3-32B model saw significant performance improvements, with AIME'24 scores increasing by 7.71 points and AIME'25 by 11.04 points, alongside an average response length increase of approximately 1378 tokens [15][17]. - Similar performance enhancements were observed in the Qwen3-14B model, while the Qwen3-8B model maintained stable performance [16]. - Conversely, training with 80% low-entropy tokens led to a sharp decline in model performance, indicating their minimal contribution to reasoning capabilities [18]. Group 4: Implications and Generalization - The findings suggest that high-entropy tokens facilitate exploration of different reasoning paths, while low-entropy tokens may restrict this exploration due to their determinism [20]. - The advantages of training with high-entropy tokens become more pronounced with larger models, with the 32B model showing the most significant improvements [22]. - Models trained with high-entropy tokens also performed exceptionally well on out-of-domain tasks, indicating a potential link between high-entropy tokens and the model's generalization capabilities [22]. Group 5: Reinforcement Learning Insights - The research indicates that reinforcement learning with verifiable rewards (RLVR) does not completely overhaul the base model but rather fine-tunes it, maintaining a high overlap of 86.67% in high-entropy token positions even after extensive training [24][25]. - The study highlights that higher initial entropy in tokens correlates with greater increases in entropy during RLVR training, while low-entropy tokens remain largely unchanged [25]. - Discussions raised in the article suggest that high-entropy tokens may explain why reinforcement learning can generalize better than supervised fine-tuning, which tends to lead to memorization and overfitting [26][27].
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 近期arxiv最热门论文, Qwen&清华LeapLab 团队最新成果: 在强化学习训练大模型推理能力时, 仅仅20%的高熵token就能撑起整个训练效果 ,甚至比用全部token训练还要好。 团队用这个发现在Qwen3-32B上创造了新的SOTA记录:AIME'24上达到63.5分,AIME'25上达到56.7分, 这是600B参数以下直接从base模型训练的最高分。 最大响应长度从20k延长到29k,AIME'24的分数更是飙升到了68.1分。 揭开Chain-of-Thought的熵分布密码 要理解这项研究,需要先从一个有趣的观察说起: 团队发现,当大模型进行链式思考(Chain-of-Thought)推理时,token的熵分布呈现出一个独特的模式: 大部分token的熵都很低,只有少 数token表现出高熵特征 。 具体来说,超过50%的token熵值低于0.01,而只有20%的token熵值大于0.672。 经典的二八法则(或帕累托法则)指出,通常80%的结果由20%的关键因素驱动,但剩下80%也是不能轻易舍弃的。 但是在大模型强化学习这里,80 ...
10行代码,AIME24/25提高15%!揭秘大模型强化学习熵机制
机器之心· 2025-06-05 07:14
Core Insights - The article discusses the entropy collapse problem in reinforcement learning for large language models (LLMs) and proposes solutions to enhance exploration capabilities during training [3][5][24]. Group 1: Entropy Collapse in Reinforcement Learning - The core challenge in reinforcement learning is the trade-off between exploitation and exploration, where policy entropy is a key indicator of exploration potential [4]. - A significant finding is that policy entropy rapidly decreases to near zero within a few training steps, indicating a loss of exploration ability, which leads to performance stagnation [4][5]. - The relationship between policy entropy and downstream performance is quantitatively analyzed, revealing that performance is entirely determined by policy entropy in the absence of entropy interventions [4][5]. Group 2: Mechanisms Behind Entropy Changes - The study identifies the driving factors behind the changes in policy entropy during reinforcement learning, focusing on the covariance between action probabilities and their corresponding advantages [5][13]. - It is found that high-advantage and high-probability actions reduce policy entropy, while rare high-advantage actions increase it [13][17]. Group 3: Proposed Solutions for Enhancing Entropy - The article introduces two simple yet effective entropy-enhancing reinforcement learning strategies, Clip-Cov and KL-Cov, which can be implemented with minimal code changes [5][22]. - Experimental results demonstrate that these methods significantly improve performance, achieving a 6.4% increase on Qwen2.5-32B and up to 15% on challenging datasets like AIME24/25 [22][24]. - The research emphasizes the importance of maintaining exploration capabilities to achieve scalable reinforcement learning, suggesting that merely increasing computational power may yield limited benefits without addressing the entropy bottleneck [7][24].
大模型RL不止数学代码!7B奖励模型搞定医学法律经济全学科, 不用思维链也能做题
量子位· 2025-04-02 07:40
Core Insights - The article discusses a new framework called RLVR developed by Tencent and Suzhou University, which extends reinforcement learning training to various disciplines beyond mathematics and coding, including medicine, chemistry, law, psychology, and economics [3][4]. Group 1: Framework and Methodology - RLVR utilizes a model-based soft reward system, which shows significant improvements in generalization, robustness, and scalability compared to traditional binary rule-based rewards [4]. - The research is based on the observation that when tasks have objective reference answers, different large language models exhibit high consistency in binary judgments (correct/incorrect) [7]. - The team distilled a 7B reward model from a 72B parameter model (Qwen2.5-Instruct) without requiring domain-specific annotations, relying solely on data collected during the online exploration phase [9]. Group 2: Experimental Results - The study sampled 6,000 questions from ExamQA, covering a wide range of subjects in science, engineering, and humanities [12]. - The RM-7B model demonstrated superior performance in free-form answer tasks compared to various baseline models, including base models, fine-tuned models, and rule-based reinforcement learning [14]. - The RM-7B model achieved an average score of 62.5 in multi-subject tasks, outperforming other methods in both binary and soft reward categories [15]. Group 3: Scalability and Future Research - The research indicates that model-based rewards have better scalability when data volume increases, suggesting a more effective approach for handling non-structured reference answers [18]. - The authors note that while chain-of-thought reasoning (CoT) is beneficial in various scenarios, its necessity for evaluating semantic equivalence between reference answers and model responses remains an open question [16]. - The study does not impose format constraints on reference answers or model responses, which reduces the labor involved in data standardization, but the role of format-related constraints and rewards needs further examination [17].