GRPO训练不再「自嗨」！快手可灵 x 中山大学推出「GRPO卫兵」，显著缓解视觉生成过优化

Core Insights - The article discusses the introduction of GRPO-Guard, a solution designed to mitigate the over-optimization problem observed in GRPO within flow models, ensuring faster convergence while significantly reducing the risk of over-optimization [3][35]. Group 1: GRPO and Over-Optimization Issues - GRPO has shown significant improvements in image and video generation flow models, but it suffers from a systematic bias in the importance ratio clipping mechanism, leading to over-optimization where the model's performance degrades despite rising proxy rewards [2][14]. - The empirical analysis indicates that the mean of the importance ratio is consistently below 1, which fails to effectively constrain overly confident positive gradients, resulting in suboptimal model performance in real applications [2][14]. Group 2: Introduction of GRPO-Guard - GRPO-Guard introduces two key improvements: RatioNorm, which normalizes the importance ratio distribution to bring the mean closer to 1, and Cross-Step Gradient Balancing, which ensures uniform exploration across the noise schedule [19][21]. - These enhancements restore the effectiveness of the clipping mechanism and stabilize policy updates, thereby alleviating the over-optimization phenomenon [35]. Group 3: Experimental Results - Experiments conducted on various GRPO variants and diffusion backbone models demonstrate that GRPO-Guard significantly alleviates over-optimization while maintaining or even improving performance compared to baseline methods [26][35]. - The results show that in baseline methods, the gold score exhibits a noticeable downward trend, while GRPO-Guard effectively mitigates this decline, indicating improved model robustness [26][28]. Group 4: Future Directions - The article suggests that while GRPO-Guard addresses over-optimization, it does not completely eliminate the issue, as there remains a significant gap between proxy scores and gold scores [35]. - Future efforts should focus on developing more accurate reward models to further reduce reward hacking and enhance optimization outcomes, providing a more reliable technical foundation for GRPO's application in flow models and broader generative tasks [35].