Core Insights - The article discusses the evolution of user expectations for language models, emphasizing the need for models to not only provide correct answers but also exhibit diverse behaviors aligned with human preferences. This has led to the introduction of multiple reward signals in reinforcement learning training processes [1][9]. Group 1: Issues with GRPO - GRPO, a widely adopted reinforcement learning algorithm, has been identified as suboptimal for multi-reward optimization scenarios. It normalizes different reward combinations to the same advantage value, which diminishes training signals and reduces reward levels [2][3]. - The authors highlight a fundamental limitation of GRPO: it compresses rich group-level reward signals, leading to information loss in advantage estimation. This can result in training instability and performance degradation over time [11][12]. Group 2: Introduction of GDPO - To address the limitations of GRPO, the authors propose a new strategy called Group Reward-Decoupled Normalization Policy Optimization (GDPO). This method normalizes each reward signal separately before aggregation, preserving the relative differences between rewards and enhancing training stability [16][17]. - GDPO has been empirically shown to produce more distinct advantage groups compared to GRPO, leading to more accurate advantage estimates and improved training outcomes across various reinforcement learning settings [17][18]. Group 3: Experimental Results - In experiments, GDPO consistently outperformed GRPO in tasks such as tool invocation and mathematical reasoning, achieving higher accuracy and better convergence rates. For instance, GDPO improved overall accuracy by approximately 2.7% and format compliance by over 4% in specific training scenarios [24][25]. - The results indicate that GDPO not only enhances performance metrics but also mitigates issues like training collapse observed with GRPO, demonstrating its robustness in maintaining model performance throughout the training process [28][29]. Group 4: Performance in Code Reasoning - GDPO was tested in code reasoning tasks, where it showed superior performance in passing rates while maintaining similar rates of excessive outputs. For example, GDPO improved pass rates by 2.6% in Codecontests while only slightly increasing the rate of excessive outputs [34][36]. - The findings suggest that GDPO effectively balances multiple objectives, achieving better overall performance in both dual and triple reward configurations compared to GRPO [36].
挑战GRPO,英伟达提出GDPO,专攻多奖励优化