GRPO

Search documents
不是玄学!港科大清华等联手:撕开推理黑箱,RL让AI像人思考
具身智能之心· 2025-10-10 00:02
Core Insights - The article discusses the recent research by teams from Hong Kong University of Science and Technology, University of Waterloo, and Tsinghua University, which reveals that large language models (LLMs) learn reasoning in a human-like manner by separating high-level strategy planning from low-level execution [3][10][12]. Group 1: Reinforcement Learning and LLMs - Reinforcement Learning (RL) enhances the reasoning capabilities of LLMs, although the underlying mechanisms have not been clearly understood until now [2][5]. - The research highlights the importance of RL in enabling models to exhibit reflective behaviors during interactions with the RL environment [7][10]. - Two significant experimental clues are identified: "length scaling effect" and "aha moment," indicating that LLMs can learn to use more thinking time to solve reasoning tasks [8][9][10]. Group 2: Learning Dynamics - The study outlines a two-phase learning dynamic in LLMs during RL training: the first phase focuses on consolidating basic execution skills, while the second phase shifts towards exploring high-level planning strategies [14][22]. - In the first phase, the model's focus is on mastering low-level operations, which is marked by a decrease in the uncertainty of execution tokens [23][24]. - The second phase involves the model actively expanding its strategy planning library, which correlates with improved reasoning accuracy and longer solution chains [28][30]. Group 3: HICRA Algorithm - The research introduces a new algorithm called HICRA (Hierarchy-Aware Credit Assignment), which emphasizes the learning of planning tokens over execution tokens to enhance reasoning capabilities [18][42]. - HICRA consistently outperforms mainstream methods like GRPO, particularly when the model has a solid foundation in execution skills [20][45]. - Experimental results show that HICRA leads to significant improvements in various reasoning benchmarks compared to GRPO, indicating its effectiveness in optimizing planning tokens [46][47]. Group 4: Insights on Token Dynamics - The study reveals that the observed phenomena, such as "aha moments" and "length scaling," are not random but are indicative of a structured learning process [33][35]. - The overall token-level entropy decreases as the model becomes more predictable in executing low-level tasks, while the semantic entropy of planning tokens increases, reflecting the model's exploration of new strategies [39][40]. - The findings suggest that the key to enhancing reasoning capabilities lies in improving planning abilities rather than merely optimizing execution details [20][41].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
3 6 Ke· 2025-09-01 04:38
Group 1 - The core concept of the article revolves around the evolution of post-training methods in large language models, particularly focusing on the GRPO algorithm as a significant advancement in reinforcement learning paradigms [2][46]. - GRPO has emerged as a universal reinforcement learning algorithm applicable to a wide range of post-training tasks, with notable improvements over previous methods like PPO [2][48]. - The article discusses the importance of post-training in enhancing the adaptability and flexibility of models, addressing the limitations of pre-training alone [5][46]. Group 2 - The article highlights the transition from PPO to GRPO, emphasizing the reduction of computational costs and memory requirements, making GRPO a more efficient alternative [18][14]. - GRPO's methodology involves using historical performance data to establish a baseline for advantage estimation, eliminating the need for a separate value function [16][14]. - Despite its advantages, GRPO still faces stability issues, prompting further research and development of improved algorithms like DAPO and GSPO [19][48]. Group 3 - DAPO, developed by ByteDance and Tsinghua AIR, builds upon GRPO by introducing enhancements such as Clip-Higher and dynamic sampling to improve training efficiency [20][21]. - GSPO represents a significant advancement by shifting the focus from token-level to sequence-level importance sampling, which enhances training stability [28][30]. - GFPO addresses the limitations of GRPO by allowing for the simultaneous optimization of multiple response attributes, thus improving the overall performance of models [33][34].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].
冗长响应缩减80%,DeepSeek GRPO获得颠覆性改进,微软GFPO问世
机器之心· 2025-08-14 04:57
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called Group Filtered Policy Optimization (GFPO), which aims to enhance the efficiency of reasoning models by significantly reducing unnecessary token lengths during inference while maintaining accuracy [2][3][9]. Summary by Sections Introduction to GFPO - GFPO is a revolutionary algorithm that balances computational costs during training and testing phases, achieving up to an 80% reduction in token length during inference [3][5]. Background on GRPO - The article explains the Group Relative Policy Optimization (GRPO) as a simplified version of the Proximal Policy Optimization (PPO) algorithm, which does not require a value model for baseline advantage estimation [7][8]. - GRPO has limitations due to its reliance on a single scalar reward signal, making it challenging to optimize multiple response attributes simultaneously, leading to increased response lengths [8][9]. Mechanism of GFPO - GFPO allows targeted strategy optimization for desired response attributes by sampling a larger candidate response group and filtering based on specific characteristics [11]. - The algorithm normalizes the advantages of selected responses using their average and standard deviation, ensuring that only the most relevant responses are considered for policy updates [13][14]. Adaptive Difficulty in GFPO - An adaptive variant of GFPO is introduced, which allocates more training signals to harder problems, dynamically adjusting the number of retained responses based on problem difficulty [21][22]. Experimental Findings - The article presents various experimental findings, including: - The importance of sampling more responses to reduce response lengths effectively [28]. - Token efficiency optimization leads to significant length reductions while maintaining accuracy, with reductions of 70.9% to 84.6% across different benchmarks [31]. - GFPO effectively mitigates out-of-distribution length inflation while slightly improving accuracy [32]. - The adaptive difficulty variant outperforms the Shortest-k algorithm in length reduction across multiple benchmarks [31][40]. Conclusion - GFPO demonstrates a substantial reduction in unnecessary response lengths during reasoning and validation phases, achieving a 94.4% reduction in excess length for answers and a 66.7% reduction for validation steps in specific benchmarks [44].
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of reinforcement learning techniques in the post-training phase of large language models (LLMs), highlighting the introduction of Group Sequence Policy Optimization (GSPO) as a solution to the instability issues associated with Group Relative Policy Optimization (GRPO) [2][10][31]. Group 1: Training Phases and Techniques - The training of large language models typically consists of two phases: pre-training and post-training, where the latter focuses on improving the model's understanding and execution of human instructions [1]. - The post-training phase employs reinforcement learning, with initial methods like Reinforcement Learning from Human Feedback (RLHF) being time-consuming and costly due to reliance on human annotators [2][3]. Group 2: Innovations and Comparisons - DeepSeek introduced an automated approach to RLHF, significantly reducing costs and improving efficiency by allowing the model to learn through reward signals rather than manual evaluations [2]. - The DeepSeek team proposed the Group Relative Policy Optimization (GRPO) algorithm, which they believe is more effective than the Proximal Policy Optimization (PPO) used by OpenAI in ChatGPT [3][5]. Group 3: Issues with GRPO - The Qwen team identified serious stability issues with GRPO, particularly due to its reliance on token-level importance sampling, which can lead to high variance and training instability [10][11][12]. - The instability arises from the incorrect application of importance sampling weights at the token level, which can accumulate high variance in long sequences, exacerbating the training challenges [15][16][17]. Group 4: Introduction of GSPO - To address the issues with GRPO, the Qwen team proposed the Group Sequence Policy Optimization (GSPO), which utilizes sequence-level importance sampling to enhance training stability [10][22][31]. - GSPO's design mitigates the accumulation of variance seen in token-level sampling, leading to improved training efficiency and stability [23][24]. Group 5: Experimental Evidence and Advantages - Experimental results demonstrated that GSPO outperformed GRPO in various tasks, showcasing better scalability and efficiency in training [20][30]. - The Qwen team highlighted that GSPO simplifies the training of Mixture-of-Experts (MoE) models by eliminating the need for auxiliary strategies like Routing Replay, which were necessary for GRPO to achieve stable convergence [25][27][30].
当提示词优化器学会进化,竟能胜过强化学习
机器之心· 2025-07-31 08:58
Core Viewpoint - The article discusses the introduction of GEPA (Genetic-Pareto), a new optimization technique that outperforms the GRPO reinforcement learning algorithm by 20% while significantly reducing the number of rollouts to 1/35 of the original [2][39]. Group 1: GEPA Overview - GEPA employs a technique called reflective prompt evolution, which enhances the performance of composite AI systems [2][6]. - The core principles of GEPA include genetic prompt evolution, utilizing natural language feedback, and Pareto-based candidate selection [7][8]. Group 2: GEPA Algorithm - GEPA initializes a candidate pool with parameters from the composite AI system and iteratively proposes new candidates until the evaluation budget is exhausted [12][15]. - The optimization process involves mutation or crossover of existing candidates, allowing GEPA to accumulate learning signals and improve candidate performance over iterations [16][17]. Group 3: Reflective Feedback Mechanism - Natural language trajectories generated during the execution of the composite AI system provide insights into the reasoning steps, enabling diagnostic value for decision-making [19][20]. - GEPA utilizes these trajectories for implicit credit assignment, allowing targeted updates to modules based on their performance [21][22]. Group 4: Candidate Selection Strategy - GEPA employs a Pareto-based candidate selection strategy to avoid local optima and ensure a balance between exploration and exploitation [27][30]. - This strategy involves identifying candidates that have achieved the best scores across training tasks, filtering out strictly dominated candidates [31][32]. Group 5: Performance Evaluation - Experimental results show that GEPA consistently outperforms MIPROv2 and GRPO across various benchmarks, achieving improvements of up to 14.29% [42][39]. - GEPA demonstrates high sample efficiency, outperforming GRPO while requiring significantly fewer rollouts [39][41]. Group 6: Observations and Insights - The next candidate selection strategy significantly impacts optimization trajectories and final performance, with Pareto-based sampling showing clear advantages [43]. - Optimized prompts from GEPA are shorter and more efficient than few-shot demonstration prompts, enhancing computational efficiency [45]. - A unique system-aware crossover strategy, GEPA+Merge, yields additional performance gains by identifying complementary strategies from different optimization lineages [47].
对VLA的RL最新进展的梳理~
自动驾驶之心· 2025-07-03 12:41
Core Viewpoint - The article discusses the recent advancements in Vision-Language-Action (VLA) models, particularly focusing on the integration of Reinforcement Learning (RL) techniques to enhance their performance and stability in various tasks [1]. Group 1: Early Exploration of iRe-VLA - The core algorithm of iRe-VLA is PPO, which introduces a two-stage training paradigm to address instability in online reinforcement learning [2]. - The implementation utilizes BLIP-2 3B as the VLM backbone, replacing the final fully connected layer with an action head that includes a token learner and an MLP [2]. - The experimental setup involves simulation environments like Meatworld and Franka Kitchen, with tasks divided into three categories for evaluation [2]. Group 2: Preference Alignment with GRAPE - GRAPE introduces preference alignment into VLA training, specifically designed for VLA characteristics [6]. - The reward for each trajectory is composed of three parts: success reward, self-reward, and external reward based on a custom cost function [8]. - The external reward is calculated by decomposing trajectories into stages and evaluating them using a VLM task decomposer [9]. Group 3: LOOP and RIPT-VLA - LOOP combines RLOO and PPO to address challenges in sparse rewards and long sequences in multi-task scenarios [11]. - The RIPT-VLA employs the LOOP algorithm for online RL and provides open-source code for implementation [13]. - The approach includes various tricks to enhance training efficiency, such as dynamic rejection mechanisms and multi-task sampling [15]. Group 4: System and Algorithm Innovations in RL4VLA - RL4VLA models the action generation process as a multi-modal dialogue, using PPO training with dense pseudo-rewards to guide the training process [18]. - The training involves a Robotic Process Reward Model that predicts the likelihood of action sequences, enhancing the reward signal [20]. - The article emphasizes adaptive curriculum selection strategies to improve sample efficiency and generalization capabilities [21][23]. Group 5: Engineering Challenges and Future Directions - The article highlights the need for new RL algorithms suitable for VLA-RL, particularly addressing sparse reward issues and enhancing sample efficiency [30]. - It points out the engineering challenges in improving sampling efficiency and managing memory costs in VLA scenarios [30]. - The exploration of effective reward design and the implementation of RL in non-autoregressive VLA structures are identified as critical areas for future research [30].