Core Insights - The article discusses the development of a new reinforcement learning algorithm called CE-GPPO, which aims to balance exploration and exploitation in training large language models [3][11][21] - The Klear team from Kuaishou Technology has made significant advancements in AI, particularly in the area of language models, achieving state-of-the-art results in mathematical and coding benchmarks [2][21] Research Motivation - The core challenge in optimizing large models for complex reasoning tasks using reinforcement learning is balancing policy entropy, which represents the uncertainty in action selection [6][21] - Existing methods face instability issues due to entropy collapse and explosion, leading to either a lack of exploration or excessive exploration [6][21] Algorithm Design - CE-GPPO introduces a new approach to gradient clipping, allowing for the retention and scaling of gradients from low-probability tokens to maintain a balance between exploration and convergence [11][15] - The algorithm employs two adjustable hyperparameters, β₁ and β₂, to control the gradient weights of different token types, facilitating a flexible adjustment between exploration and exploitation [15][24] Experimental Results - CE-GPPO was tested on multiple mathematical reasoning benchmarks, showing superior performance compared to other methods, particularly in high-difficulty tasks [20][21] - The results indicate that larger model sizes benefit more from CE-GPPO, demonstrating its scalability potential [21][24] Comparison with Other Algorithms - CE-GPPO outperformed other recent reinforcement learning algorithms like CISPO and GSPO, showcasing its effectiveness in maintaining training stability and performance [35][36] - The method also demonstrated advantages over traditional entropy regularization techniques, maintaining a stable entropy curve throughout training [37]
快手Klear团队提出CE-GPPO:通过梯度保留协调熵,解决强化学习中的熵不稳定问题