重要性采样
Search documents
如何准备RL面试相关的问题?
自动驾驶之心· 2025-09-12 16:03
Core Insights - The article discusses the GRPO (Group Relative Policy Optimization) framework, primarily categorizing it as on-policy but acknowledging its potential off-policy adaptations [5][6][7] - It emphasizes the importance of understanding the data sources and the implications of using old policy data in the context of on-policy and off-policy learning [10][11] GRPO Framework - GRPO is typically considered on-policy as it estimates group-relative advantage using data generated by the current behavior policy [5][6] - Recent works have explored off-policy adaptations of GRPO, utilizing data from older policies to enhance sample efficiency and stability [4][5][7] - The original implementation of GRPO relies on current policy data to estimate gradients and advantages, aligning with traditional on-policy definitions [6][10] Importance Sampling - Importance Sampling (IS) is a key method in off-policy evaluation, allowing the use of data from a behavior policy to assess the value of a target policy [8][9] - The article outlines the mathematical formulation of IS, highlighting its role in correcting biases arising from differences in sampling distributions [12][14] - Weighted Importance Sampling is introduced as a solution to the high variance problem associated with basic IS [15][16][17] GSPO and DAPO - GSPO (Group Sequence Policy Optimization) addresses high variance and instability issues in GRPO/PPO by shifting the focus to sequence-level importance ratios [18][21] - DAPO (Decoupled Clip & Dynamic Sampling Policy Optimization) enhances training stability and sample efficiency in long chain-of-thought tasks through various engineering techniques [20][24] - Both GSPO and DAPO aim to improve the robustness of training processes in large-scale language models, particularly in handling long sequences and mitigating entropy collapse [20][24][27] Entropy Collapse - Entropy collapse refers to the rapid decrease in policy randomness during training, leading to reduced exploration and potential suboptimal convergence [28][30] - The article discusses various strategies to mitigate entropy collapse, including entropy regularization, KL penalties, and dynamic sampling [32][33][34] - It emphasizes the need for a balance between exploration and exploitation to maintain effective training dynamics [37][41] Relationship Between Reward Hacking and Entropy Collapse - Reward hacking occurs when an agent finds shortcuts to maximize rewards, often leading to entropy collapse as the policy becomes overly deterministic [41][42] - The article outlines the cyclical relationship between reward hacking and entropy collapse, suggesting that addressing one can help mitigate the other [41][42] - Strategies for managing both issues include refining reward functions, enhancing training stability, and ensuring diverse sampling [47][48]
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of reinforcement learning techniques in the post-training phase of large language models (LLMs), highlighting the introduction of Group Sequence Policy Optimization (GSPO) as a solution to the instability issues associated with Group Relative Policy Optimization (GRPO) [2][10][31]. Group 1: Training Phases and Techniques - The training of large language models typically consists of two phases: pre-training and post-training, where the latter focuses on improving the model's understanding and execution of human instructions [1]. - The post-training phase employs reinforcement learning, with initial methods like Reinforcement Learning from Human Feedback (RLHF) being time-consuming and costly due to reliance on human annotators [2][3]. Group 2: Innovations and Comparisons - DeepSeek introduced an automated approach to RLHF, significantly reducing costs and improving efficiency by allowing the model to learn through reward signals rather than manual evaluations [2]. - The DeepSeek team proposed the Group Relative Policy Optimization (GRPO) algorithm, which they believe is more effective than the Proximal Policy Optimization (PPO) used by OpenAI in ChatGPT [3][5]. Group 3: Issues with GRPO - The Qwen team identified serious stability issues with GRPO, particularly due to its reliance on token-level importance sampling, which can lead to high variance and training instability [10][11][12]. - The instability arises from the incorrect application of importance sampling weights at the token level, which can accumulate high variance in long sequences, exacerbating the training challenges [15][16][17]. Group 4: Introduction of GSPO - To address the issues with GRPO, the Qwen team proposed the Group Sequence Policy Optimization (GSPO), which utilizes sequence-level importance sampling to enhance training stability [10][22][31]. - GSPO's design mitigates the accumulation of variance seen in token-level sampling, leading to improved training efficiency and stability [23][24]. Group 5: Experimental Evidence and Advantages - Experimental results demonstrated that GSPO outperformed GRPO in various tasks, showcasing better scalability and efficiency in training [20][30]. - The Qwen team highlighted that GSPO simplifies the training of Mixture-of-Experts (MoE) models by eliminating the need for auxiliary strategies like Routing Replay, which were necessary for GRPO to achieve stable convergence [25][27][30].