GRPO

Search documents
冗长响应缩减80%,DeepSeek GRPO获得颠覆性改进,微软GFPO问世
机器之心· 2025-08-14 04:57
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called Group Filtered Policy Optimization (GFPO), which aims to enhance the efficiency of reasoning models by significantly reducing unnecessary token lengths during inference while maintaining accuracy [2][3][9]. Summary by Sections Introduction to GFPO - GFPO is a revolutionary algorithm that balances computational costs during training and testing phases, achieving up to an 80% reduction in token length during inference [3][5]. Background on GRPO - The article explains the Group Relative Policy Optimization (GRPO) as a simplified version of the Proximal Policy Optimization (PPO) algorithm, which does not require a value model for baseline advantage estimation [7][8]. - GRPO has limitations due to its reliance on a single scalar reward signal, making it challenging to optimize multiple response attributes simultaneously, leading to increased response lengths [8][9]. Mechanism of GFPO - GFPO allows targeted strategy optimization for desired response attributes by sampling a larger candidate response group and filtering based on specific characteristics [11]. - The algorithm normalizes the advantages of selected responses using their average and standard deviation, ensuring that only the most relevant responses are considered for policy updates [13][14]. Adaptive Difficulty in GFPO - An adaptive variant of GFPO is introduced, which allocates more training signals to harder problems, dynamically adjusting the number of retained responses based on problem difficulty [21][22]. Experimental Findings - The article presents various experimental findings, including: - The importance of sampling more responses to reduce response lengths effectively [28]. - Token efficiency optimization leads to significant length reductions while maintaining accuracy, with reductions of 70.9% to 84.6% across different benchmarks [31]. - GFPO effectively mitigates out-of-distribution length inflation while slightly improving accuracy [32]. - The adaptive difficulty variant outperforms the Shortest-k algorithm in length reduction across multiple benchmarks [31][40]. Conclusion - GFPO demonstrates a substantial reduction in unnecessary response lengths during reasoning and validation phases, achieving a 94.4% reduction in excess length for answers and a 66.7% reduction for validation steps in specific benchmarks [44].
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
机器之心报道 机器之心编辑部 众所周知,大型语言模型的训练通常分为两个阶段。 第一 阶段 是「预训练」 ,开发者利用大规模文本数据集训练模型,让它学会预测句子中的下一个词。 第二 阶段是「后训练」 ,旨在教会模型如何更好地理解和执行人类指令。 在 LLM 后训练阶段,似乎是一个强化学习的特殊形式。用于大语言模型(LLMs)微调的强化学习(RL)算法正沿着一条明确的演进路径持续发展。 起初,OpenAI 开创了一种名为 基于 人类反馈的强化学习(RLHF) 的技术,用于改进 ChatGPT。RLHF 的核心是让人类标注员对模型生成的多种响应进行打分, 并选出最优答案作为训练参考。这一过程虽然有效,但也耗时、昂贵且依赖人力,通常需要一支小型但专业的数据标注团队。 DeepSeek 的重要创新在于用 RL 技术自动化了这一环节。算法不再依赖人工逐一评估,而是让模型在探索过程中,通过获得「奖励信号」自主学习正确行为,从 而显著降低了成本,提高了效率,最终能以较低的成本实现高性能。 OpenAI 在 ChatGPT 的训练中采用了 近端策略优化(Proximal Policy Optimization, PPO) 。 ...
当提示词优化器学会进化,竟能胜过强化学习
机器之心· 2025-07-31 08:58
Core Viewpoint - The article discusses the introduction of GEPA (Genetic-Pareto), a new optimization technique that outperforms the GRPO reinforcement learning algorithm by 20% while significantly reducing the number of rollouts to 1/35 of the original [2][39]. Group 1: GEPA Overview - GEPA employs a technique called reflective prompt evolution, which enhances the performance of composite AI systems [2][6]. - The core principles of GEPA include genetic prompt evolution, utilizing natural language feedback, and Pareto-based candidate selection [7][8]. Group 2: GEPA Algorithm - GEPA initializes a candidate pool with parameters from the composite AI system and iteratively proposes new candidates until the evaluation budget is exhausted [12][15]. - The optimization process involves mutation or crossover of existing candidates, allowing GEPA to accumulate learning signals and improve candidate performance over iterations [16][17]. Group 3: Reflective Feedback Mechanism - Natural language trajectories generated during the execution of the composite AI system provide insights into the reasoning steps, enabling diagnostic value for decision-making [19][20]. - GEPA utilizes these trajectories for implicit credit assignment, allowing targeted updates to modules based on their performance [21][22]. Group 4: Candidate Selection Strategy - GEPA employs a Pareto-based candidate selection strategy to avoid local optima and ensure a balance between exploration and exploitation [27][30]. - This strategy involves identifying candidates that have achieved the best scores across training tasks, filtering out strictly dominated candidates [31][32]. Group 5: Performance Evaluation - Experimental results show that GEPA consistently outperforms MIPROv2 and GRPO across various benchmarks, achieving improvements of up to 14.29% [42][39]. - GEPA demonstrates high sample efficiency, outperforming GRPO while requiring significantly fewer rollouts [39][41]. Group 6: Observations and Insights - The next candidate selection strategy significantly impacts optimization trajectories and final performance, with Pareto-based sampling showing clear advantages [43]. - Optimized prompts from GEPA are shorter and more efficient than few-shot demonstration prompts, enhancing computational efficiency [45]. - A unique system-aware crossover strategy, GEPA+Merge, yields additional performance gains by identifying complementary strategies from different optimization lineages [47].
对VLA的RL最新进展的梳理~
自动驾驶之心· 2025-07-03 12:41
Core Viewpoint - The article discusses the recent advancements in Vision-Language-Action (VLA) models, particularly focusing on the integration of Reinforcement Learning (RL) techniques to enhance their performance and stability in various tasks [1]. Group 1: Early Exploration of iRe-VLA - The core algorithm of iRe-VLA is PPO, which introduces a two-stage training paradigm to address instability in online reinforcement learning [2]. - The implementation utilizes BLIP-2 3B as the VLM backbone, replacing the final fully connected layer with an action head that includes a token learner and an MLP [2]. - The experimental setup involves simulation environments like Meatworld and Franka Kitchen, with tasks divided into three categories for evaluation [2]. Group 2: Preference Alignment with GRAPE - GRAPE introduces preference alignment into VLA training, specifically designed for VLA characteristics [6]. - The reward for each trajectory is composed of three parts: success reward, self-reward, and external reward based on a custom cost function [8]. - The external reward is calculated by decomposing trajectories into stages and evaluating them using a VLM task decomposer [9]. Group 3: LOOP and RIPT-VLA - LOOP combines RLOO and PPO to address challenges in sparse rewards and long sequences in multi-task scenarios [11]. - The RIPT-VLA employs the LOOP algorithm for online RL and provides open-source code for implementation [13]. - The approach includes various tricks to enhance training efficiency, such as dynamic rejection mechanisms and multi-task sampling [15]. Group 4: System and Algorithm Innovations in RL4VLA - RL4VLA models the action generation process as a multi-modal dialogue, using PPO training with dense pseudo-rewards to guide the training process [18]. - The training involves a Robotic Process Reward Model that predicts the likelihood of action sequences, enhancing the reward signal [20]. - The article emphasizes adaptive curriculum selection strategies to improve sample efficiency and generalization capabilities [21][23]. Group 5: Engineering Challenges and Future Directions - The article highlights the need for new RL algorithms suitable for VLA-RL, particularly addressing sparse reward issues and enhancing sample efficiency [30]. - It points out the engineering challenges in improving sampling efficiency and managing memory costs in VLA scenarios [30]. - The exploration of effective reward design and the implementation of RL in non-autoregressive VLA structures are identified as critical areas for future research [30].