Reward Model

Search documents
奖励模型终于迎来预训练新时代!上海AI Lab、复旦POLAR,开启Scaling新范式
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the limitations of current reward modeling methods in reinforcement learning, particularly in the context of large language models (LLMs), and introduces a new paradigm called POLAR that aims to enhance scalability and generalization in reward modeling [2][3][5]. Group 1: Current Reward Modeling Methods - Preference-based Reward Modeling relies on high-quality preference data, which is costly and difficult to scale, and struggles with generalization and susceptibility to reward hacking [3][4]. - Rule-based Verifier methods provide accurate reward signals for verifiable tasks but fail to extend to more general scenarios like open-domain dialogue and complex interactions [3][4]. Group 2: Introduction of POLAR - POLAR, developed by a team from Shanghai AI Lab and Fudan University, utilizes Policy Discriminative Learning to decouple from absolute preferences, allowing for efficient scaling and strong generalization capabilities [5][9]. - The training process of POLAR involves measuring the "distance" between candidate strategies and optimal strategies, providing a relative reward signal that does not depend on human-annotated preferences [9][10]. Group 3: Training Methodology - POLAR's pre-training corpus is constructed through automated data synthesis, sampling from LLM pre-training data and using a large pool of models for trajectory sampling [14][15]. - The pre-training objective employs Bradley-Terry Loss to assign higher rewards to trajectories generated by similar strategies, effectively modeling the differences in strategy distributions [14][15]. Group 4: Performance and Generalization - POLAR demonstrates superior performance in preference evaluation, outperforming state-of-the-art reward models by significant margins in various tasks, including STEM [33]. - In reinforcement fine-tuning (RFT) experiments, models fine-tuned with POLAR show an average improvement of 9.0% over initial results, highlighting its effectiveness in enhancing LLM capabilities [34]. Group 5: Scaling Effects - POLAR exhibits scaling laws similar to LLM Next Token Prediction, indicating that increased computational resources lead to improved reward model performance [35]. - The validation loss decreases in a power-law relationship with the increase in model parameters and training compute, suggesting the potential for building more powerful and generalizable reward models [35]. Conclusion - POLAR represents a novel and scalable approach to reward modeling, offering new possibilities for LLM post-training and addressing the challenges in reinforcement learning [37].
北大腾讯突破奖励模型瓶颈!让AI理解人类偏好,泛化能力比肩GPT-4.1
量子位· 2025-06-26 02:11
RA团队 发自 凹非寺 量子位 | 公众号 QbitAI 总是"死记硬背""知其然不知其所以然"? 奖励模型 训练也形成了学生选择标准答案的学习模式,陷入诸如"长回答=好回答""好格式=好答案"等错误规律之中。 北京大学知识计算实验室联合腾讯微信模式识别中心、William&Mary、西湖大学等机构提出的 RewardAnything 突破了这一瓶颈 ——通过让奖励模型直接理解自然语言描述的评判原则,实现了从"死记硬背"到"融会贯通"的范式跃迁。 RewardAnything降低了传统模式针对不同场景需要收集偏好数据训练奖励模型再进行RL的高昂成本,能够直接利用自然语言作为 RLHF的标准。 其作为奖励模型,仅需一句话描述的准则即可刷新传统Benchmark的SOTA,在RABench上展示出了与GPT-4.1等顶尖模型相媲美的 原则跟随能力与泛化能力。 | Model | Domains | | | | | Principle Categories | | | Overall | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- ...