Workflow
RePro
icon
Search documents
让大模型不再过度思考!上海AI Lab后训练新范式重塑CoT,推理又快又好
量子位· 2025-12-21 02:00
Core Viewpoint - The article discusses the introduction of a new post-training paradigm called RePro (Rectifying Process-level Reward) aimed at improving the reasoning efficiency of large language models (LLMs) by addressing the issue of "overthinking" during inference [2][30]. Group 1: RePro Overview - RePro views the reasoning process as an optimization of the model's internal state, providing a fresh perspective on reshaping the Chain-of-Thought (CoT) in large models [3]. - The core idea of RePro is to treat the model's reasoning trajectory as a path to find the optimal solution on a loss surface [3]. Group 2: Correction Mechanisms - RePro incorporates a process reward mechanism directly into reinforcement learning with value regression (RLVR) processes like PPO and GRPO [4]. - It features a computable objective function J that quantifies the model's confidence in its current reasoning context, with higher values indicating greater confidence in the correctness of the answer [5][6]. Group 3: Reasoning Quality Assessment - RePro introduces a dual scoring mechanism to evaluate reasoning quality based on the growth rate and smoothness of the objective function J [10]. - The Magnitude Score measures the improvement in the objective function, while the Stability Score assesses whether the reasoning process is smooth or filled with hesitation [11][13]. Group 4: Integration into RL Training - RePro employs an entropy filtering strategy to reduce computational costs by segmenting the reasoning chain into logical paragraphs and selecting only the top-k segments for reward calculation [18][20]. - The process-level reward is calculated based on the improvement in the process score, which is combined with the final correctness to serve as the advantage function input for reinforcement learning [21][22]. Group 5: Experimental Results - RePro has been tested across various tasks, showing stable improvements in accuracy across different RL algorithms, including PPO and GRPO [23]. - The model demonstrated a significant reduction in the average number of tokens generated during reasoning, indicating a more efficient inference process [25][27]. - Instances of backtracking behavior during reasoning were significantly reduced, showcasing improved logical flow in the model's thought process [28].