可验证的过程奖励 - filings, earnings calls, financial reports, news

可验证的过程奖励

Search documents

Sou Hu Cai Jing· 2025-09-11 22:23

Core Insights - The article discusses the introduction of a Verifiable Step Reward Mechanism (VSRM) aimed at addressing the issue of "overthinking" in models, particularly in mathematical reasoning tasks [6][12][23]. Group 1: Overthinking Problem - The phenomenon of overthinking is characterized by models providing multiple answers to simple problems, leading to incorrect conclusions due to ineffective reasoning steps [10][11]. - A case study highlighted that models often oscillate between correct and incorrect answers, resulting in erroneous final conclusions [10][12]. Group 2: VSRM Introduction - VSRM combines verifiable rewards with step-level rewards to encourage effective reasoning steps while penalizing ineffective ones, thus optimizing the reasoning process [12][20]. - The mechanism is designed to provide clear reward signals for each reasoning step, enhancing the model's ability to discern between effective and ineffective steps [20][23]. Group 3: Experimental Results - Experiments on common benchmarks demonstrated that VSRM significantly reduces output length while maintaining model performance, achieving a balance between efficiency and effectiveness [21][23]. - Ablation studies confirmed the effectiveness of the forward-looking window mechanism within VSRM, showing that it helps maintain exploration capabilities without sacrificing performance [22][23].

攻克AI过度思考难题！美团新研究让通过“可验证”过程奖励激活LRM的高效推理

量子位· 2025-09-11 10:19

美团搜推Agentic System X (AsX) 团队投稿量子位 | 公众号 QbitAI LRM通过简单却有效的RLVR范式，培养了强大的CoT推理能力，但伴随而来的冗长的输出内容，不仅显著增加推理开销，还会影响服务的吞吐量，这种消磨用户耐心的现象被称为"过度思考"问题。针对这一缺陷，来自美团等机构的研究团队提出可验证的过程奖励机制（VSRM），鼓励CoT中的"有效步骤"，惩戒"无效步骤"，最大限度保持性能的同时，实现高效推理。通过在数学任务上的实验显示，在多个常用benchmark上， VSRM加持的后训练使得不同尺度的模型实现了输出长度的大幅缩减，甚至在部分情况下提升了模型表现。过度思考问题的本质此前的工作将过度思考问题的现象总结为：对于一个问题，模型倾向于给出多种不同的解答，特别简单的问题。在这一认识的基础上，作者团队更进一步，对现有LRM在MATH-500上做出的回复进行了深入的case study。 | Find the number of integer values of k in the closed interval [-500,500] for whic ...