挑战GRPO,英伟达提出GDPO,专攻多奖励优化

Core Viewpoint - The article discusses the limitations of the GRPO algorithm in multi-reward optimization scenarios and introduces a new method called GDPO, which aims to improve the accuracy and stability of training in reinforcement learning by decoupling the normalization of rewards [2][4][17]. Summary by Sections GRPO Limitations - GRPO is primarily used for optimizing single-objective rewards, focusing on accuracy. However, as model capabilities improve, there is a shift towards optimizing multiple rewards, such as response length and format quality, to align better with human preferences [10][12]. - The application of GRPO in multi-reward scenarios leads to the normalization of different rewards into the same advantage value, which diminishes training signals and reduces reward levels [4][10]. Introduction of GDPO - GDPO, or Group reward-Decoupled Normalization Policy Optimization, addresses the issues of GRPO by normalizing each reward signal separately before aggregation, thus preserving the relative differences between rewards and enhancing training stability [17][18]. - This method allows for a more accurate representation of the advantages associated with different reward combinations, leading to improved performance in multi-reward reinforcement learning [18]. Experimental Results - In various tasks, including tool invocation, mathematical reasoning, and code reasoning, GDPO consistently outperformed GRPO in terms of accuracy and stability. For instance, GDPO achieved a nearly 5% improvement in accuracy for the Qwen2.5-Instruct-1.5B model compared to GRPO [25][26]. - GDPO demonstrated better convergence in training curves, particularly in tasks requiring format compliance and accuracy, while GRPO exhibited instability and a decline in performance after a certain number of training steps [19][28]. Performance Metrics - In the tool invocation task, GDPO reached higher values for format compliance and accuracy rewards, with a notable increase in correct format ratios, achieving over 80% compared to GRPO's 76% [26]. - In mathematical reasoning tasks, GDPO effectively maintained and improved accuracy rewards, while GRPO faced a decline in performance after initial training steps [29][30]. Conclusion - Overall, GDPO proves to be a more effective method for multi-reward optimization in reinforcement learning, providing better training stability and performance across various tasks compared to GRPO [37].

挑战GRPO,英伟达提出GDPO,专攻多奖励优化 - Reportify