GVPO
Search documents
NeurIPS 25 | GRPO进阶版来了,GVPO重构大模型后训练范式
机器之心· 2025-10-14 02:06
Core Viewpoint - Post-training of large models is becoming a key aspect of AI evolution, focusing on enhancing reasoning capabilities, aligning with human preferences, and maintaining stability and efficiency [1]. Summary by Sections GVPO Introduction - The team from Zuoyebang and Hong Kong University of Science and Technology proposed a new method called GVPO (Group Variance Policy Optimization) to address the instability issues of GRPO (Generalized Reward Policy Optimization) [2]. Design Motivation - Inspired by DPO (Direct Preference Optimization), the research team aims to maximize rewards under KL constraints in the GRPO scenario, which involves multiple samplings for each prompt [5]. Practical Challenges - A significant challenge is the expectation calculation of Z(x) across all possible samples, which is nearly impractical. The team found that ensuring the sum of gradient weights for all samples under the same prompt equals zero allows Z(x) to cancel out, thus avoiding this computational difficulty [6]. Key Advantages of GVPO 1. **Unique Optimal Solution Guarantee**: GVPO's MSE form provides a strict mathematical proof that it achieves a unique optimal solution when R_θ equals R, ensuring algorithm effectiveness and stability [13]. 2. **No Need for Importance Sampling**: GVPO's optimal solution has minimal restrictions on sampling distribution, allowing for off-policy training without the common instability issues associated with importance sampling [14]. Analytical Perspectives - GVPO can be understood from three complementary analytical perspectives, each corresponding to an equivalent loss function: 1. **Negative Log-Likelihood Perspective (NLL)**: GVPO's loss function can be viewed as a weighted negative log-likelihood, allowing for flexible integration of historical and heterogeneous data sources [17]. 2. **Mean Squared Error Perspective (MSE)**: The optimization goal is to minimize the deviation between implicit and actual rewards, ensuring convergence to a unique global optimal solution under KL constraints [18]. 3. **Reinforcement Learning Perspective (RL)**: This perspective highlights the three components of the GVPO loss function, emphasizing the balance between actual and predicted rewards [19]. Experimental Results - In mathematical reasoning tasks, GVPO outperformed GRPO and its improved version Dr.GRPO across five benchmark tests, significantly enhancing the base model's performance [21]. - Ablation studies indicate GVPO's insensitivity to hyperparameter β and its excellent scalability with increased sampling numbers, allowing smaller models to match larger ones [23]. Significance and Future Prospects - GVPO represents a paradigm shift in post-training, moving from experience-driven approaches to those with theoretical guarantees, enhancing stability, flexibility, and efficiency in large model training [25][26].