GRPO
Search documents
GRPO训练不再「自嗨」!快手可灵 x 中山大学推出「GRPO卫兵」,显著缓解视觉生成过优化
机器之心· 2025-11-13 04:12
Core Insights - The article discusses the introduction of GRPO-Guard, a solution designed to mitigate the over-optimization problem observed in GRPO within flow models, ensuring faster convergence while significantly reducing the risk of over-optimization [3][35]. Group 1: GRPO and Over-Optimization Issues - GRPO has shown significant improvements in image and video generation flow models, but it suffers from a systematic bias in the importance ratio clipping mechanism, leading to over-optimization where the model's performance degrades despite rising proxy rewards [2][14]. - The empirical analysis indicates that the mean of the importance ratio is consistently below 1, which fails to effectively constrain overly confident positive gradients, resulting in suboptimal model performance in real applications [2][14]. Group 2: Introduction of GRPO-Guard - GRPO-Guard introduces two key improvements: RatioNorm, which normalizes the importance ratio distribution to bring the mean closer to 1, and Cross-Step Gradient Balancing, which ensures uniform exploration across the noise schedule [19][21]. - These enhancements restore the effectiveness of the clipping mechanism and stabilize policy updates, thereby alleviating the over-optimization phenomenon [35]. Group 3: Experimental Results - Experiments conducted on various GRPO variants and diffusion backbone models demonstrate that GRPO-Guard significantly alleviates over-optimization while maintaining or even improving performance compared to baseline methods [26][35]. - The results show that in baseline methods, the gold score exhibits a noticeable downward trend, while GRPO-Guard effectively mitigates this decline, indicating improved model robustness [26][28]. Group 4: Future Directions - The article suggests that while GRPO-Guard addresses over-optimization, it does not completely eliminate the issue, as there remains a significant gap between proxy scores and gold scores [35]. - Future efforts should focus on developing more accurate reward models to further reduce reward hacking and enhance optimization outcomes, providing a more reliable technical foundation for GRPO's application in flow models and broader generative tasks [35].
对比学习视角,GRPO即DPO?
自动驾驶之心· 2025-10-18 16:03
Core Insights - The article discusses the development of efficient GRPO (Generalized Reinforcement Policy Optimization) and its implications for reinforcement learning, highlighting the challenges and breakthroughs encountered during the research process [1][2]. Group 1: Research Development - The initial focus was on improving the speed of GRPO, with an emphasis on sampling efficiency, which is a common challenge in reinforcement learning [2][3]. - The author experimented with tree-based sampling methods but found that they did not yield the expected improvements in efficiency [3]. - A second approach involved "speculative sampling," which aimed to exit upon obtaining a correct sample, but faced implementation challenges that hindered performance [3][4]. Group 2: Methodological Innovations - The third approach utilized historical data to estimate the correctness of prompts, leading to a more efficient sampling strategy based on Bayesian methods [4]. - Experiments showed that reducing the number of rollouts per prompt did not significantly impact performance, indicating robustness in the methodology [4][5]. - The exploration of contrastive learning principles led to insights about the relationship between DPO (Direct Policy Optimization) and GRPO, suggesting potential avenues for further research [5]. Group 3: Community and Collaboration - The article emphasizes the importance of community engagement in advancing research, highlighting the role of discussions and collaborations in refining ideas and methodologies [8][10]. - The establishment of a comprehensive community focused on large model technologies aims to facilitate knowledge sharing and collaboration across various domains, including academic research and practical applications [9][10].
如何准备RL面试相关的问题?
自动驾驶之心· 2025-09-12 16:03
Core Insights - The article discusses the GRPO (Group Relative Policy Optimization) framework, primarily categorizing it as on-policy but acknowledging its potential off-policy adaptations [5][6][7] - It emphasizes the importance of understanding the data sources and the implications of using old policy data in the context of on-policy and off-policy learning [10][11] GRPO Framework - GRPO is typically considered on-policy as it estimates group-relative advantage using data generated by the current behavior policy [5][6] - Recent works have explored off-policy adaptations of GRPO, utilizing data from older policies to enhance sample efficiency and stability [4][5][7] - The original implementation of GRPO relies on current policy data to estimate gradients and advantages, aligning with traditional on-policy definitions [6][10] Importance Sampling - Importance Sampling (IS) is a key method in off-policy evaluation, allowing the use of data from a behavior policy to assess the value of a target policy [8][9] - The article outlines the mathematical formulation of IS, highlighting its role in correcting biases arising from differences in sampling distributions [12][14] - Weighted Importance Sampling is introduced as a solution to the high variance problem associated with basic IS [15][16][17] GSPO and DAPO - GSPO (Group Sequence Policy Optimization) addresses high variance and instability issues in GRPO/PPO by shifting the focus to sequence-level importance ratios [18][21] - DAPO (Decoupled Clip & Dynamic Sampling Policy Optimization) enhances training stability and sample efficiency in long chain-of-thought tasks through various engineering techniques [20][24] - Both GSPO and DAPO aim to improve the robustness of training processes in large-scale language models, particularly in handling long sequences and mitigating entropy collapse [20][24][27] Entropy Collapse - Entropy collapse refers to the rapid decrease in policy randomness during training, leading to reduced exploration and potential suboptimal convergence [28][30] - The article discusses various strategies to mitigate entropy collapse, including entropy regularization, KL penalties, and dynamic sampling [32][33][34] - It emphasizes the need for a balance between exploration and exploitation to maintain effective training dynamics [37][41] Relationship Between Reward Hacking and Entropy Collapse - Reward hacking occurs when an agent finds shortcuts to maximize rewards, often leading to entropy collapse as the policy becomes overly deterministic [41][42] - The article outlines the cyclical relationship between reward hacking and entropy collapse, suggesting that addressing one can help mitigate the other [41][42] - Strategies for managing both issues include refining reward functions, enhancing training stability, and ensuring diverse sampling [47][48]
从RLHF、PPO到GRPO再训练推理模型,这是你需要的强化学习入门指南
机器之心· 2025-06-22 04:26
Core Insights - Reinforcement Learning (RL) has become an essential technology in the AI field, particularly in large language models (LLMs) [1] - The Unsloth team has released a comprehensive reinforcement learning tutorial that covers various concepts from RLHF to GRPO, making it accessible for beginners and advanced users alike [2][3] Group 1: Understanding Reinforcement Learning - The goal of reinforcement learning is to increase the likelihood of achieving "good" outcomes while reducing the chances of "bad" outcomes [8][10] - Key components of RL include the environment, agent, actions, and reward functions, which collectively define the learning process [9][14] - RLHF (Reinforcement Learning from Human Feedback) has gained popularity, particularly through OpenAI's implementation, which trains agents to generate outputs deemed useful by humans [16][19] Group 2: GRPO and Its Advantages - GRPO (Group Relative Policy Optimization) is a method developed to train reasoning models, differing from PPO (Proximal Policy Optimization) by removing the value model and utilizing custom reward functions [22][24] - GRPO estimates average rewards through sampling multiple outputs for a given question, which helps in optimizing the model's performance [27][28] - The approach allows for significant memory savings and can enhance various tasks beyond coding and mathematics, such as email automation and legal applications [30] Group 3: Training with Unsloth - Unsloth provides a detailed guide for training reasoning models using GRPO, requiring a minimum of 5GB VRAM for local training of models up to 1.5 billion parameters [44] - The training process involves generating multiple answer variants for each question, evaluating them with a reward function, and updating model weights accordingly [45][57] - Effective training requires a well-designed reward function and a sufficient amount of data, with recommendations for at least 500 lines for optimal results [49][50] Group 4: Reward Functions and Validators - Reward functions and validators play crucial roles in evaluating model outputs, with the former assigning scores based on correctness and quality, while the latter verifies the accuracy of the outputs [46][56] - Examples of reward functions include those that reward correct answers and penalize incorrect or overly verbose responses [61] - The design of reward functions is critical, as poorly constructed ones can inadvertently degrade model performance [57]