Workflow
GFPO
icon
Search documents
冗长响应缩减80%,DeepSeek GRPO获得颠覆性改进,微软GFPO问世
机器之心· 2025-08-14 04:57
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called Group Filtered Policy Optimization (GFPO), which aims to enhance the efficiency of reasoning models by significantly reducing unnecessary token lengths during inference while maintaining accuracy [2][3][9]. Summary by Sections Introduction to GFPO - GFPO is a revolutionary algorithm that balances computational costs during training and testing phases, achieving up to an 80% reduction in token length during inference [3][5]. Background on GRPO - The article explains the Group Relative Policy Optimization (GRPO) as a simplified version of the Proximal Policy Optimization (PPO) algorithm, which does not require a value model for baseline advantage estimation [7][8]. - GRPO has limitations due to its reliance on a single scalar reward signal, making it challenging to optimize multiple response attributes simultaneously, leading to increased response lengths [8][9]. Mechanism of GFPO - GFPO allows targeted strategy optimization for desired response attributes by sampling a larger candidate response group and filtering based on specific characteristics [11]. - The algorithm normalizes the advantages of selected responses using their average and standard deviation, ensuring that only the most relevant responses are considered for policy updates [13][14]. Adaptive Difficulty in GFPO - An adaptive variant of GFPO is introduced, which allocates more training signals to harder problems, dynamically adjusting the number of retained responses based on problem difficulty [21][22]. Experimental Findings - The article presents various experimental findings, including: - The importance of sampling more responses to reduce response lengths effectively [28]. - Token efficiency optimization leads to significant length reductions while maintaining accuracy, with reductions of 70.9% to 84.6% across different benchmarks [31]. - GFPO effectively mitigates out-of-distribution length inflation while slightly improving accuracy [32]. - The adaptive difficulty variant outperforms the Shortest-k algorithm in length reduction across multiple benchmarks [31][40]. Conclusion - GFPO demonstrates a substantial reduction in unnecessary response lengths during reasoning and validation phases, achieving a 94.4% reduction in excess length for answers and a 66.7% reduction for validation steps in specific benchmarks [44].