Workflow
策略判别学习
icon
Search documents
奖励模型也能Scaling!上海AI Lab突破强化学习短板,提出策略判别学习新范式
量子位· 2025-07-11 04:00
Core Viewpoint - The article discusses the introduction of a new reward modeling paradigm called Policy Discriminative Learning (POLAR), which enhances the post-training phase of large language models (LLMs) and addresses the limitations of traditional reward models in reinforcement learning [1][3][4]. Group 1: Challenges in Reward Modeling - The design and training of reward models have been a bottleneck in improving the effectiveness of post-training and model capabilities [2]. - Traditional reward models lack systematic pre-training and scaling methods, hindering their ability to improve alongside computational resources [2]. Group 2: Introduction of POLAR - POLAR decouples from absolute preferences and allows for efficient scaling of reward modeling, enabling adaptability to various customized needs based on reference answers [3][5]. - POLAR can assign different scores to model outputs based on varying reference styles without needing to retrain the reward model [7]. Group 3: Training Methodology of POLAR - POLAR employs a two-stage training process: pre-training and preference fine-tuning, utilizing a contrastive learning approach to measure the distance between training and target strategies [21][22]. - The pre-training phase uses a large amount of automated synthetic data, allowing for significant scalability [22][23]. Group 4: Performance and Scaling Effects - POLAR demonstrates scaling effects, with validation loss decreasing in a power-law relationship as model parameters and computational resources increase [28][29]. - In preference evaluation experiments, POLAR outperforms state-of-the-art reward models, showing significant improvements in various tasks, particularly in STEM-related tasks [32][34]. - POLAR's ability to learn subtle distinctions between strategy models enhances the generalization of reward signals in real-world applications [35].