MEITUAN-攻克AI过度思考难题！美团新研究让通过“可验证”过程奖励激活LRM的高效推理

Core Insights - The article discusses the introduction of a Verifiable Step Reward Mechanism (VSRM) aimed at addressing the issue of "overthinking" in models, particularly in mathematical reasoning tasks [6][12][23]. Group 1: Overthinking Problem - The phenomenon of overthinking is characterized by models providing multiple answers to simple problems, leading to incorrect conclusions due to ineffective reasoning steps [10][11]. - A case study highlighted that models often oscillate between correct and incorrect answers, resulting in erroneous final conclusions [10][12]. Group 2: VSRM Introduction - VSRM combines verifiable rewards with step-level rewards to encourage effective reasoning steps while penalizing ineffective ones, thus optimizing the reasoning process [12][20]. - The mechanism is designed to provide clear reward signals for each reasoning step, enhancing the model's ability to discern between effective and ineffective steps [20][23]. Group 3: Experimental Results - Experiments on common benchmarks demonstrated that VSRM significantly reduces output length while maintaining model performance, achieving a balance between efficiency and effectiveness [21][23]. - Ablation studies confirmed the effectiveness of the forward-looking window mechanism within VSRM, showing that it helps maintain exploration capabilities without sacrificing performance [22][23].