华为攻克AI推理「想太多」问题！新方法让大模型推理提速60%，准确率还高了

Core Viewpoint - Huawei has introduced S-GRPO, an efficient reasoning method that effectively addresses the "redundant thinking" bottleneck in large language models, enabling them to generate more precise and useful answers while speeding up reasoning by 60% [2][4][24]. Group 1: S-GRPO Methodology - S-GRPO stands for Serial Group Decaying Reward Policy Optimization, which enhances the reasoning efficiency and accuracy of large language models by addressing redundant thinking [4][24]. - The method employs a "serial grouping + decaying reward" design that allows models to learn to terminate reasoning early without compromising accuracy [2][10]. - S-GRPO introduces the concept of "early exit reasoning," where models can stop reasoning at any intermediate step and generate answers, thus creating multiple early exit branches for training [8][9]. Group 2: Training Framework - The training framework of S-GRPO consists of three main stages: full thought rollout, early-exit thought rollout, and reward computation and parameter update [11][14][16]. - In the full thought rollout, the model generates a complete reasoning path, which serves as a foundation for generating early exit paths [13]. - The early-exit thought rollout involves random truncation of the complete reasoning path to create multiple early exit paths, with prompts to encourage stopping and answering [14][15]. Group 3: Experimental Results - S-GRPO was tested on five challenging reasoning benchmarks, including four mathematical reasoning tasks and one scientific reasoning task, demonstrating significant performance improvements [21][24]. - Compared to the vanilla reasoning model, S-GRPO improved accuracy by 0.72 to 6.08 percentage points while reducing generation length by 35.4% to 61.1% [24]. - S-GRPO outperformed existing state-of-the-art efficient reasoning methods, maintaining accuracy while reducing reasoning length [25][28]. Group 4: Ablation Studies - Ablation studies indicated that removing the decaying reward mechanism or the serial group generation design negatively impacted the model's reasoning accuracy and efficiency [34][36]. - The experiments showed that S-GRPO effectively mitigated overthinking issues, allowing the model to converge towards concise and correct reasoning paths [38][39].