Workflow
可验证的过程奖励机制(VSRM)
icon
Search documents
攻克AI过度思考难题!美团新研究让通过“可验证”过程奖励激活LRM的高效推理
Sou Hu Cai Jing· 2025-09-11 22:23
Core Insights - The article discusses the introduction of a Verifiable Step Reward Mechanism (VSRM) aimed at addressing the issue of "overthinking" in models, particularly in mathematical reasoning tasks [6][12][23]. Group 1: Overthinking Problem - The phenomenon of overthinking is characterized by models providing multiple answers to simple problems, leading to incorrect conclusions due to ineffective reasoning steps [10][11]. - A case study highlighted that models often oscillate between correct and incorrect answers, resulting in erroneous final conclusions [10][12]. Group 2: VSRM Introduction - VSRM combines verifiable rewards with step-level rewards to encourage effective reasoning steps while penalizing ineffective ones, thus optimizing the reasoning process [12][20]. - The mechanism is designed to provide clear reward signals for each reasoning step, enhancing the model's ability to discern between effective and ineffective steps [20][23]. Group 3: Experimental Results - Experiments on common benchmarks demonstrated that VSRM significantly reduces output length while maintaining model performance, achieving a balance between efficiency and effectiveness [21][23]. - Ablation studies confirmed the effectiveness of the forward-looking window mechanism within VSRM, showing that it helps maintain exploration capabilities without sacrificing performance [22][23].
攻克AI过度思考难题!美团新研究让通过“可验证”过程奖励激活LRM的高效推理
量子位· 2025-09-11 10:19
美团搜推Agentic System X (AsX) 团队 投稿 量子位 | 公众号 QbitAI LRM通过简单却有效的RLVR范式,培养了强大的CoT推理能力,但伴随而来的冗长的输出内容,不仅显著增加推理开销,还会影响服务的吞 吐量,这种消磨用户耐心的现象被称为"过度思考"问题。 针对这一缺陷,来自美团等机构的研究团队提出 可验证的过程奖励机制(VSRM) , 鼓励CoT中的"有效步骤",惩戒"无效步骤",最大限 度保持性能的同时,实现高效推理 。 通过在数学任务上的实验显示,在多个常用benchmark上, VSRM加持的后训练使得不同尺度的模型实现了输出长度的大幅缩减 ,甚至在部 分情况下提升了模型表现。 过度思考问题的本质 此前的工作将过度思考问题的现象总结为:对于一个问题,模型倾向于给出多种不同的解答,特别简单的问题。在这一认识的基础上,作者团 队更进一步,对现有LRM在MATH-500上做出的回复进行了深入的case study。 | Find the number of integer values of k in the closed interval [-500,500] for whic ...