GRPO
Search documents
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
机器之心· 2026-01-11 04:00
但随着语言模型能力的不断提升,用户对它们的期待也在发生变化:不仅要回答正确,还要在各种不同场景下表现出符合多样化人类偏好的行为。为此, 强化学 习训练流程开始引入多种奖励信号 ,每一种奖励对应一种不同的偏好,用来共同引导模型走向理想的行为模式。 但英伟达的一篇新论文却指出,在进行多奖励优化时,GRPO 可能不是最佳选择。 具体来说,在多奖励优化场景中,GRPO 会将不同的奖励组合归一化为相同的优势值。这会削弱训练信号,降低奖励水平。 为了解决这一问题,他们提出了一种新的策略优化方法 —— 组奖励解耦归一化策略优化( GDPO )。该方法通过对各个奖励信号分别进行归一化,避免了不同奖 励之间被混合「抹平」,从而更真实地保留它们的相对差异,使多奖励优化更加准确,同时显著提升了训练过程的稳定性。 机器之心编辑部 GRPO 是促使 DeepSeek-R1 成功的基础技术之一。最近一两年,GRPO 及其变体因其高效性和简洁性,已成为业内广泛采用的强化学习算法。 论文标题:GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-re ...
NeurIPS25高分论文|以判别式监督学习强化推理LLM,解决难度偏差和熵崩塌难题
机器之心· 2025-10-26 07:00
Core Insights - The article discusses the introduction of a novel framework called Discriminative Constrained Optimization (DisCO) aimed at enhancing large reasoning models (LRMs) by addressing inherent limitations of the Group Relative Policy Optimization (GRPO) method, particularly in binary reward settings [3][4][6][32]. Summary by Sections Introduction to DisCO - DisCO is proposed as a solution to the difficulty bias and entropy instability issues found in GRPO and its variants, allowing for the integration of advanced discriminative learning techniques to tackle data imbalance problems [4][6][32]. Advantages of DisCO - DisCO significantly outperforms GRPO and its improved versions, achieving an average gain of 7% over GRPO and 6% over DAPO across six benchmark tasks with a 1.5 billion parameter model [4][22]. - Notably, DisCO with a maximum response length of 8k outperforms GRPO with a maximum response length of 32k [4]. Methodology - The framework eliminates difficulty bias by adopting a discriminative optimization objective, which maximizes the score of correct answers while minimizing that of incorrect ones [6][11]. - It employs non-clipped scoring functions and a constrained optimization approach to stabilize training dynamics, addressing issues of entropy instability [6][19][28]. Experimental Results - DisCO consistently demonstrates superior performance across various models, including a 3.5% improvement over GRPO in 7 billion parameter experiments [22]. - The training dynamics of DisCO show a steady increase in training rewards and stable generation entropy, contrasting with the instability observed in GRPO and its variants [27][28]. Ablation Studies - The analysis of individual components within DisCO reveals that each component contributes significantly to its overall performance, with the use of non-clipped scoring functions being particularly critical [30]. Future Directions - While the current focus is on binary rewards, the authors suggest that future research could explore the application of DisCO to non-binary reward scenarios, potentially utilizing novel scoring functions from supervised learning [32].
X上63万人围观的Traning-Free GRPO:把GRPO搬进上下文空间学习
机器之心· 2025-10-22 08:46
Core Viewpoint - The article discusses the introduction of Training-Free Group Relative Policy Optimization (GRPO), a method that allows for reinforcement learning (RL) without the need to update model parameters, making it more accessible and cost-effective for developers and smaller teams [4][20][28]. Summary by Sections GRPO Overview - GRPO has gained popularity in large model reinforcement learning, particularly for tasks like mathematical reasoning and multi-agent collaboration [2]. - The core mechanism of GRPO involves "multi-path parallelism + group advantage," which, while powerful, is costly in terms of model parameter optimization [3]. Training-Free GRPO - Tencent Youtu's recent paper proposes a solution to the high costs of parameter updates by moving the GRPO learning process into the context space, allowing for multiple answer paths to be generated and evaluated without changing model parameters [4][6]. - The method involves generating multiple rollout paths for the same problem, scoring them, and using the advantage signals to refine the model's preferences for high-quality solutions [4][10]. Experimental Results - In mathematical reasoning tasks, Training-Free GRPO can enhance performance using only 100 training samples at a cost of approximately $8 to $18 on a 671 billion parameter model [13][24]. - The method shows significant improvements in performance metrics, such as a 4.6% increase in Pass@1 in web search scenarios without updating model parameters [17][18]. Advantages of Training-Free GRPO - The approach retains the advantages of GRPO, including multi-path exploration and independent training/testing sets, while drastically reducing costs by eliminating the need for parameter updates [20][21]. - It allows for better generalization across different tasks without the complexity and maintenance costs associated with multiple specialized models [25]. Conclusion - Training-Free GRPO represents a shift in the understanding of reinforcement learning, demonstrating that effective RL can be achieved without traditional parameter updates, making it a viable option for developers with limited resources [26][28].
NeurIPS 25 | GRPO进阶版来了,GVPO重构大模型后训练范式
机器之心· 2025-10-14 02:06
Core Viewpoint - Post-training of large models is becoming a key aspect of AI evolution, focusing on enhancing reasoning capabilities, aligning with human preferences, and maintaining stability and efficiency [1]. Summary by Sections GVPO Introduction - The team from Zuoyebang and Hong Kong University of Science and Technology proposed a new method called GVPO (Group Variance Policy Optimization) to address the instability issues of GRPO (Generalized Reward Policy Optimization) [2]. Design Motivation - Inspired by DPO (Direct Preference Optimization), the research team aims to maximize rewards under KL constraints in the GRPO scenario, which involves multiple samplings for each prompt [5]. Practical Challenges - A significant challenge is the expectation calculation of Z(x) across all possible samples, which is nearly impractical. The team found that ensuring the sum of gradient weights for all samples under the same prompt equals zero allows Z(x) to cancel out, thus avoiding this computational difficulty [6]. Key Advantages of GVPO 1. **Unique Optimal Solution Guarantee**: GVPO's MSE form provides a strict mathematical proof that it achieves a unique optimal solution when R_θ equals R, ensuring algorithm effectiveness and stability [13]. 2. **No Need for Importance Sampling**: GVPO's optimal solution has minimal restrictions on sampling distribution, allowing for off-policy training without the common instability issues associated with importance sampling [14]. Analytical Perspectives - GVPO can be understood from three complementary analytical perspectives, each corresponding to an equivalent loss function: 1. **Negative Log-Likelihood Perspective (NLL)**: GVPO's loss function can be viewed as a weighted negative log-likelihood, allowing for flexible integration of historical and heterogeneous data sources [17]. 2. **Mean Squared Error Perspective (MSE)**: The optimization goal is to minimize the deviation between implicit and actual rewards, ensuring convergence to a unique global optimal solution under KL constraints [18]. 3. **Reinforcement Learning Perspective (RL)**: This perspective highlights the three components of the GVPO loss function, emphasizing the balance between actual and predicted rewards [19]. Experimental Results - In mathematical reasoning tasks, GVPO outperformed GRPO and its improved version Dr.GRPO across five benchmark tests, significantly enhancing the base model's performance [21]. - Ablation studies indicate GVPO's insensitivity to hyperparameter β and its excellent scalability with increased sampling numbers, allowing smaller models to match larger ones [23]. Significance and Future Prospects - GVPO represents a paradigm shift in post-training, moving from experience-driven approaches to those with theoretical guarantees, enhancing stability, flexibility, and efficiency in large model training [25][26].
不是玄学!港科大清华等联手:撕开推理黑箱,RL让AI像人思考
具身智能之心· 2025-10-10 00:02
Core Insights - The article discusses the recent research by teams from Hong Kong University of Science and Technology, University of Waterloo, and Tsinghua University, which reveals that large language models (LLMs) learn reasoning in a human-like manner by separating high-level strategy planning from low-level execution [3][10][12]. Group 1: Reinforcement Learning and LLMs - Reinforcement Learning (RL) enhances the reasoning capabilities of LLMs, although the underlying mechanisms have not been clearly understood until now [2][5]. - The research highlights the importance of RL in enabling models to exhibit reflective behaviors during interactions with the RL environment [7][10]. - Two significant experimental clues are identified: "length scaling effect" and "aha moment," indicating that LLMs can learn to use more thinking time to solve reasoning tasks [8][9][10]. Group 2: Learning Dynamics - The study outlines a two-phase learning dynamic in LLMs during RL training: the first phase focuses on consolidating basic execution skills, while the second phase shifts towards exploring high-level planning strategies [14][22]. - In the first phase, the model's focus is on mastering low-level operations, which is marked by a decrease in the uncertainty of execution tokens [23][24]. - The second phase involves the model actively expanding its strategy planning library, which correlates with improved reasoning accuracy and longer solution chains [28][30]. Group 3: HICRA Algorithm - The research introduces a new algorithm called HICRA (Hierarchy-Aware Credit Assignment), which emphasizes the learning of planning tokens over execution tokens to enhance reasoning capabilities [18][42]. - HICRA consistently outperforms mainstream methods like GRPO, particularly when the model has a solid foundation in execution skills [20][45]. - Experimental results show that HICRA leads to significant improvements in various reasoning benchmarks compared to GRPO, indicating its effectiveness in optimizing planning tokens [46][47]. Group 4: Insights on Token Dynamics - The study reveals that the observed phenomena, such as "aha moments" and "length scaling," are not random but are indicative of a structured learning process [33][35]. - The overall token-level entropy decreases as the model becomes more predictable in executing low-level tasks, while the semantic entropy of planning tokens increases, reflecting the model's exploration of new strategies [39][40]. - The findings suggest that the key to enhancing reasoning capabilities lies in improving planning abilities rather than merely optimizing execution details [20][41].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
3 6 Ke· 2025-09-01 04:38
Group 1 - The core concept of the article revolves around the evolution of post-training methods in large language models, particularly focusing on the GRPO algorithm as a significant advancement in reinforcement learning paradigms [2][46]. - GRPO has emerged as a universal reinforcement learning algorithm applicable to a wide range of post-training tasks, with notable improvements over previous methods like PPO [2][48]. - The article discusses the importance of post-training in enhancing the adaptability and flexibility of models, addressing the limitations of pre-training alone [5][46]. Group 2 - The article highlights the transition from PPO to GRPO, emphasizing the reduction of computational costs and memory requirements, making GRPO a more efficient alternative [18][14]. - GRPO's methodology involves using historical performance data to establish a baseline for advantage estimation, eliminating the need for a separate value function [16][14]. - Despite its advantages, GRPO still faces stability issues, prompting further research and development of improved algorithms like DAPO and GSPO [19][48]. Group 3 - DAPO, developed by ByteDance and Tsinghua AIR, builds upon GRPO by introducing enhancements such as Clip-Higher and dynamic sampling to improve training efficiency [20][21]. - GSPO represents a significant advancement by shifting the focus from token-level to sequence-level importance sampling, which enhances training stability [28][30]. - GFPO addresses the limitations of GRPO by allowing for the simultaneous optimization of multiple response attributes, thus improving the overall performance of models [33][34].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].
冗长响应缩减80%,DeepSeek GRPO获得颠覆性改进,微软GFPO问世
机器之心· 2025-08-14 04:57
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called Group Filtered Policy Optimization (GFPO), which aims to enhance the efficiency of reasoning models by significantly reducing unnecessary token lengths during inference while maintaining accuracy [2][3][9]. Summary by Sections Introduction to GFPO - GFPO is a revolutionary algorithm that balances computational costs during training and testing phases, achieving up to an 80% reduction in token length during inference [3][5]. Background on GRPO - The article explains the Group Relative Policy Optimization (GRPO) as a simplified version of the Proximal Policy Optimization (PPO) algorithm, which does not require a value model for baseline advantage estimation [7][8]. - GRPO has limitations due to its reliance on a single scalar reward signal, making it challenging to optimize multiple response attributes simultaneously, leading to increased response lengths [8][9]. Mechanism of GFPO - GFPO allows targeted strategy optimization for desired response attributes by sampling a larger candidate response group and filtering based on specific characteristics [11]. - The algorithm normalizes the advantages of selected responses using their average and standard deviation, ensuring that only the most relevant responses are considered for policy updates [13][14]. Adaptive Difficulty in GFPO - An adaptive variant of GFPO is introduced, which allocates more training signals to harder problems, dynamically adjusting the number of retained responses based on problem difficulty [21][22]. Experimental Findings - The article presents various experimental findings, including: - The importance of sampling more responses to reduce response lengths effectively [28]. - Token efficiency optimization leads to significant length reductions while maintaining accuracy, with reductions of 70.9% to 84.6% across different benchmarks [31]. - GFPO effectively mitigates out-of-distribution length inflation while slightly improving accuracy [32]. - The adaptive difficulty variant outperforms the Shortest-k algorithm in length reduction across multiple benchmarks [31][40]. Conclusion - GFPO demonstrates a substantial reduction in unnecessary response lengths during reasoning and validation phases, achieving a 94.4% reduction in excess length for answers and a 66.7% reduction for validation steps in specific benchmarks [44].
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of reinforcement learning techniques in the post-training phase of large language models (LLMs), highlighting the introduction of Group Sequence Policy Optimization (GSPO) as a solution to the instability issues associated with Group Relative Policy Optimization (GRPO) [2][10][31]. Group 1: Training Phases and Techniques - The training of large language models typically consists of two phases: pre-training and post-training, where the latter focuses on improving the model's understanding and execution of human instructions [1]. - The post-training phase employs reinforcement learning, with initial methods like Reinforcement Learning from Human Feedback (RLHF) being time-consuming and costly due to reliance on human annotators [2][3]. Group 2: Innovations and Comparisons - DeepSeek introduced an automated approach to RLHF, significantly reducing costs and improving efficiency by allowing the model to learn through reward signals rather than manual evaluations [2]. - The DeepSeek team proposed the Group Relative Policy Optimization (GRPO) algorithm, which they believe is more effective than the Proximal Policy Optimization (PPO) used by OpenAI in ChatGPT [3][5]. Group 3: Issues with GRPO - The Qwen team identified serious stability issues with GRPO, particularly due to its reliance on token-level importance sampling, which can lead to high variance and training instability [10][11][12]. - The instability arises from the incorrect application of importance sampling weights at the token level, which can accumulate high variance in long sequences, exacerbating the training challenges [15][16][17]. Group 4: Introduction of GSPO - To address the issues with GRPO, the Qwen team proposed the Group Sequence Policy Optimization (GSPO), which utilizes sequence-level importance sampling to enhance training stability [10][22][31]. - GSPO's design mitigates the accumulation of variance seen in token-level sampling, leading to improved training efficiency and stability [23][24]. Group 5: Experimental Evidence and Advantages - Experimental results demonstrated that GSPO outperformed GRPO in various tasks, showcasing better scalability and efficiency in training [20][30]. - The Qwen team highlighted that GSPO simplifies the training of Mixture-of-Experts (MoE) models by eliminating the need for auxiliary strategies like Routing Replay, which were necessary for GRPO to achieve stable convergence [25][27][30].
当提示词优化器学会进化,竟能胜过强化学习
机器之心· 2025-07-31 08:58
Core Viewpoint - The article discusses the introduction of GEPA (Genetic-Pareto), a new optimization technique that outperforms the GRPO reinforcement learning algorithm by 20% while significantly reducing the number of rollouts to 1/35 of the original [2][39]. Group 1: GEPA Overview - GEPA employs a technique called reflective prompt evolution, which enhances the performance of composite AI systems [2][6]. - The core principles of GEPA include genetic prompt evolution, utilizing natural language feedback, and Pareto-based candidate selection [7][8]. Group 2: GEPA Algorithm - GEPA initializes a candidate pool with parameters from the composite AI system and iteratively proposes new candidates until the evaluation budget is exhausted [12][15]. - The optimization process involves mutation or crossover of existing candidates, allowing GEPA to accumulate learning signals and improve candidate performance over iterations [16][17]. Group 3: Reflective Feedback Mechanism - Natural language trajectories generated during the execution of the composite AI system provide insights into the reasoning steps, enabling diagnostic value for decision-making [19][20]. - GEPA utilizes these trajectories for implicit credit assignment, allowing targeted updates to modules based on their performance [21][22]. Group 4: Candidate Selection Strategy - GEPA employs a Pareto-based candidate selection strategy to avoid local optima and ensure a balance between exploration and exploitation [27][30]. - This strategy involves identifying candidates that have achieved the best scores across training tasks, filtering out strictly dominated candidates [31][32]. Group 5: Performance Evaluation - Experimental results show that GEPA consistently outperforms MIPROv2 and GRPO across various benchmarks, achieving improvements of up to 14.29% [42][39]. - GEPA demonstrates high sample efficiency, outperforming GRPO while requiring significantly fewer rollouts [39][41]. Group 6: Observations and Insights - The next candidate selection strategy significantly impacts optimization trajectories and final performance, with Pareto-based sampling showing clear advantages [43]. - Optimized prompts from GEPA are shorter and more efficient than few-shot demonstration prompts, enhancing computational efficiency [45]. - A unique system-aware crossover strategy, GEPA+Merge, yields additional performance gains by identifying complementary strategies from different optimization lineages [47].