GPPO算法

Search documents
快手Klear-Reasoner登顶8B模型榜首,GPPO算法双效强化稳定性与探索能力!
AI前线· 2025-08-22 06:07
Core Viewpoint - The competition in large language models has highlighted the importance of mathematical and coding reasoning capabilities, with the introduction of the Klear-Reasoner model by Kuaishou's Klear team, which achieves state-of-the-art performance in various benchmarks [1][2]. Group 1: Model Performance - Klear-Reasoner outperforms other strong open-source models in benchmarks such as AIME2024 and AIME2025, achieving scores of 90.5% and 83.2% respectively, making it the top 8B model [2]. - The model's performance is attributed to the innovative GPPO (Gradient-Preserving Clipping Policy Optimization) algorithm, which enhances exploration capabilities while maintaining training stability [5][24]. Group 2: Technical Innovations - The GPPO algorithm allows for the retention of all gradients during training, which contrasts with traditional clipping methods that can hinder model exploration and slow down convergence [8][10]. - GPPO enables high-entropy tokens to participate in backpropagation, thus preserving exploration ability and accelerating error correction [10]. Group 3: Training Methodology - The Klear team emphasizes the importance of data quality over quantity during the supervised fine-tuning (SFT) phase, demonstrating that high-quality data sources yield better training efficiency and outcomes [12]. - For high-difficulty tasks, retaining some erroneous samples can enhance model performance by providing additional exploration opportunities [16]. - In the reinforcement learning (RL) phase, using soft rewards based on test case pass rates is more effective than hard rewards, leading to improved training stability and efficiency [19]. Group 4: Future Implications - The release of Klear-Reasoner not only showcases impressive performance but also offers a reproducible and scalable approach for reasoning models in supervised and reinforcement learning tasks, providing valuable insights for future applications in mathematics, coding, and other RLVR tasks [24].