奖励模型终于迎来预训练新时代！上海AI Lab、复旦POLAR，开启Scaling新范式

Core Viewpoint - The article discusses the limitations of current reward modeling methods in reinforcement learning, particularly in the context of large language models (LLMs), and introduces a new paradigm called POLAR that aims to enhance scalability and generalization in reward modeling [2][3][5]. Group 1: Current Reward Modeling Methods - Preference-based Reward Modeling relies on high-quality preference data, which is costly and difficult to scale, and struggles with generalization and susceptibility to reward hacking [3][4]. - Rule-based Verifier methods provide accurate reward signals for verifiable tasks but fail to extend to more general scenarios like open-domain dialogue and complex interactions [3][4]. Group 2: Introduction of POLAR - POLAR, developed by a team from Shanghai AI Lab and Fudan University, utilizes Policy Discriminative Learning to decouple from absolute preferences, allowing for efficient scaling and strong generalization capabilities [5][9]. - The training process of POLAR involves measuring the "distance" between candidate strategies and optimal strategies, providing a relative reward signal that does not depend on human-annotated preferences [9][10]. Group 3: Training Methodology - POLAR's pre-training corpus is constructed through automated data synthesis, sampling from LLM pre-training data and using a large pool of models for trajectory sampling [14][15]. - The pre-training objective employs Bradley-Terry Loss to assign higher rewards to trajectories generated by similar strategies, effectively modeling the differences in strategy distributions [14][15]. Group 4: Performance and Generalization - POLAR demonstrates superior performance in preference evaluation, outperforming state-of-the-art reward models by significant margins in various tasks, including STEM [33]. - In reinforcement fine-tuning (RFT) experiments, models fine-tuned with POLAR show an average improvement of 9.0% over initial results, highlighting its effectiveness in enhancing LLM capabilities [34]. Group 5: Scaling Effects - POLAR exhibits scaling laws similar to LLM Next Token Prediction, indicating that increased computational resources lead to improved reward model performance [35]. - The validation loss decreases in a power-law relationship with the increase in model parameters and training compute, suggesting the potential for building more powerful and generalizable reward models [35]. Conclusion - POLAR represents a novel and scalable approach to reward modeling, offering new possibilities for LLM post-training and addressing the challenges in reinforcement learning [37].