重要性采样

Search documents
如何准备RL面试相关的问题?
自动驾驶之心· 2025-09-12 16:03
作者 | Abel chen 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/1948681769332240910 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 本文只做学术分享,如有侵权,联系删文 1. GRPO是on policy还是off policy?为什么? 简短答案: GRPO 最初设计和常用实现是 on-policy(在线/近端策略式) ;但它可以被扩展为 off-policy,已有工作专门研究这种扩展及其优缺点。 为什么是 on-policy(解释) 为什么有人说可以 off-policy(扩展) 最近有工作把 GRPO 的思想推广到 off-policy 场景(比如用来自别的策略 / 旧批次的数据来估计优势并做修正),并且报告了在样本效率、稳定性等方面的潜在好 处与权衡。也就是说,虽然 GRPO 本质上是基于 on-policy 的 surrogate objective,但数学上和工程上可以设计重要性采样、批内归一化或裁剪等技巧把它改成 off- policy 版本。 实践建议(简要) ...
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of reinforcement learning techniques in the post-training phase of large language models (LLMs), highlighting the introduction of Group Sequence Policy Optimization (GSPO) as a solution to the instability issues associated with Group Relative Policy Optimization (GRPO) [2][10][31]. Group 1: Training Phases and Techniques - The training of large language models typically consists of two phases: pre-training and post-training, where the latter focuses on improving the model's understanding and execution of human instructions [1]. - The post-training phase employs reinforcement learning, with initial methods like Reinforcement Learning from Human Feedback (RLHF) being time-consuming and costly due to reliance on human annotators [2][3]. Group 2: Innovations and Comparisons - DeepSeek introduced an automated approach to RLHF, significantly reducing costs and improving efficiency by allowing the model to learn through reward signals rather than manual evaluations [2]. - The DeepSeek team proposed the Group Relative Policy Optimization (GRPO) algorithm, which they believe is more effective than the Proximal Policy Optimization (PPO) used by OpenAI in ChatGPT [3][5]. Group 3: Issues with GRPO - The Qwen team identified serious stability issues with GRPO, particularly due to its reliance on token-level importance sampling, which can lead to high variance and training instability [10][11][12]. - The instability arises from the incorrect application of importance sampling weights at the token level, which can accumulate high variance in long sequences, exacerbating the training challenges [15][16][17]. Group 4: Introduction of GSPO - To address the issues with GRPO, the Qwen team proposed the Group Sequence Policy Optimization (GSPO), which utilizes sequence-level importance sampling to enhance training stability [10][22][31]. - GSPO's design mitigates the accumulation of variance seen in token-level sampling, leading to improved training efficiency and stability [23][24]. Group 5: Experimental Evidence and Advantages - Experimental results demonstrated that GSPO outperformed GRPO in various tasks, showcasing better scalability and efficiency in training [20][30]. - The Qwen team highlighted that GSPO simplifies the training of Mixture-of-Experts (MoE) models by eliminating the need for auxiliary strategies like Routing Replay, which were necessary for GRPO to achieve stable convergence [25][27][30].