Workflow
GDPO
icon
Search documents
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
具身智能之心· 2026-01-13 00:54
Core Viewpoint - The article discusses the limitations of the GRPO algorithm in multi-reward optimization scenarios and introduces a new method called GDPO, which aims to improve the accuracy and stability of training in reinforcement learning by decoupling the normalization of rewards [2][4][17]. Summary by Sections GRPO Limitations - GRPO is primarily used for optimizing single-objective rewards, focusing on accuracy. However, as model capabilities improve, there is a shift towards optimizing multiple rewards, such as response length and format quality, to align better with human preferences [10][12]. - The application of GRPO in multi-reward scenarios leads to the normalization of different rewards into the same advantage value, which diminishes training signals and reduces reward levels [4][10]. Introduction of GDPO - GDPO, or Group reward-Decoupled Normalization Policy Optimization, addresses the issues of GRPO by normalizing each reward signal separately before aggregation, thus preserving the relative differences between rewards and enhancing training stability [17][18]. - This method allows for a more accurate representation of the advantages associated with different reward combinations, leading to improved performance in multi-reward reinforcement learning [18]. Experimental Results - In various tasks, including tool invocation, mathematical reasoning, and code reasoning, GDPO consistently outperformed GRPO in terms of accuracy and stability. For instance, GDPO achieved a nearly 5% improvement in accuracy for the Qwen2.5-Instruct-1.5B model compared to GRPO [25][26]. - GDPO demonstrated better convergence in training curves, particularly in tasks requiring format compliance and accuracy, while GRPO exhibited instability and a decline in performance after a certain number of training steps [19][28]. Performance Metrics - In the tool invocation task, GDPO reached higher values for format compliance and accuracy rewards, with a notable increase in correct format ratios, achieving over 80% compared to GRPO's 76% [26]. - In mathematical reasoning tasks, GDPO effectively maintained and improved accuracy rewards, while GRPO faced a decline in performance after initial training steps [29][30]. Conclusion - Overall, GDPO proves to be a more effective method for multi-reward optimization in reinforcement learning, providing better training stability and performance across various tasks compared to GRPO [37].
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
机器之心· 2026-01-11 04:00
Core Insights - The article discusses the evolution of user expectations for language models, emphasizing the need for models to not only provide correct answers but also exhibit diverse behaviors aligned with human preferences. This has led to the introduction of multiple reward signals in reinforcement learning training processes [1][9]. Group 1: Issues with GRPO - GRPO, a widely adopted reinforcement learning algorithm, has been identified as suboptimal for multi-reward optimization scenarios. It normalizes different reward combinations to the same advantage value, which diminishes training signals and reduces reward levels [2][3]. - The authors highlight a fundamental limitation of GRPO: it compresses rich group-level reward signals, leading to information loss in advantage estimation. This can result in training instability and performance degradation over time [11][12]. Group 2: Introduction of GDPO - To address the limitations of GRPO, the authors propose a new strategy called Group Reward-Decoupled Normalization Policy Optimization (GDPO). This method normalizes each reward signal separately before aggregation, preserving the relative differences between rewards and enhancing training stability [16][17]. - GDPO has been empirically shown to produce more distinct advantage groups compared to GRPO, leading to more accurate advantage estimates and improved training outcomes across various reinforcement learning settings [17][18]. Group 3: Experimental Results - In experiments, GDPO consistently outperformed GRPO in tasks such as tool invocation and mathematical reasoning, achieving higher accuracy and better convergence rates. For instance, GDPO improved overall accuracy by approximately 2.7% and format compliance by over 4% in specific training scenarios [24][25]. - The results indicate that GDPO not only enhances performance metrics but also mitigates issues like training collapse observed with GRPO, demonstrating its robustness in maintaining model performance throughout the training process [28][29]. Group 4: Performance in Code Reasoning - GDPO was tested in code reasoning tasks, where it showed superior performance in passing rates while maintaining similar rates of excessive outputs. For example, GDPO improved pass rates by 2.6% in Codecontests while only slightly increasing the rate of excessive outputs [34][36]. - The findings suggest that GDPO effectively balances multiple objectives, achieving better overall performance in both dual and triple reward configurations compared to GRPO [36].