ICLR 2026｜UIUC：一行代码彻底解决LLM推理的过度思考！

Core Insights - The article discusses the introduction of the Self-Aligned Reward (SAR) method, which aims to enhance reasoning efficiency and accuracy in large language models by addressing the "overthinking" phenomenon observed in existing reinforcement learning frameworks [3][25]. Group 1: Introduction of DeepSeek-R1 and RLVR - On January 20, 2025, DeepSeek released the reasoning model DeepSeek-R1, sparking significant interest in reinforcement learning methods for large models [2]. - Researchers found that using simple feedback signals like "correct/incorrect" in tasks with clear answers allowed models to learn complex reasoning strategies, leading to improved reasoning capabilities [2]. Group 2: Limitations of RLVR - Despite the success of RLVR, it faces limitations, particularly the "overthinking" phenomenon, where models generate unnecessarily lengthy and repetitive reasoning processes for simple questions [3]. - This issue reduces reasoning efficiency and increases costs, highlighting a critical challenge that needs to be addressed in current RLVR methods [3][4]. Group 3: Proposed Solutions and SAR - Researchers have identified that the root cause of the overthinking phenomenon lies in the coarse-grained nature of the reward signals in RLVR, which do not differentiate between intermediate reasoning steps [4]. - A common approach to mitigate this issue involves imposing explicit constraints on reasoning length, such as penalizing the total number of tokens generated, but this often compromises overall accuracy [5]. - To address these challenges, researchers from the University of Illinois at Urbana-Champaign and Amazon AWS proposed the Self-Aligned Reward (SAR), which utilizes internal signals from large language models to provide feedback on the usefulness of reasoning processes [6][25]. Group 4: Characteristics of SAR - SAR is designed to be continuous and finely grained, allowing for a more nuanced assessment of output quality rather than binary feedback [9]. - It avoids introducing complex evaluation frameworks or independent reward models, thus reducing implementation and training costs [10]. - SAR directly engages with semantic information during the reasoning process, accurately reflecting the effectiveness and relevance of the reasoning content [10]. Group 5: Experimental Results - The article presents experimental evaluations across four foundational models and seven datasets, demonstrating that SAR can be seamlessly integrated into mainstream reinforcement learning algorithms like PPO and GRPO [18]. - The introduction of SAR led to an average accuracy improvement of approximately 4% and a reduction in output length by at least 30% compared to baseline methods using only RLVR [18][23]. - SAR showed stable and excellent performance across various tasks, including logical reasoning, indicating its strong cross-task generalization capability [18]. Group 6: Conclusion and Future Implications - The study introduces SAR as a simple yet effective solution to the overthinking problem in reinforcement learning reasoning models, enhancing both accuracy and computational efficiency [25]. - SAR reflects a new approach in the field of large model reinforcement learning, transforming internal model information into continuous feedback signals for sustainable training [25].