奖励模型

Search documents
奖励模型也能Scaling!上海AI Lab突破强化学习短板,提出策略判别学习新范式
量子位· 2025-07-11 04:00
Core Viewpoint - The article discusses the introduction of a new reward modeling paradigm called Policy Discriminative Learning (POLAR), which enhances the post-training phase of large language models (LLMs) and addresses the limitations of traditional reward models in reinforcement learning [1][3][4]. Group 1: Challenges in Reward Modeling - The design and training of reward models have been a bottleneck in improving the effectiveness of post-training and model capabilities [2]. - Traditional reward models lack systematic pre-training and scaling methods, hindering their ability to improve alongside computational resources [2]. Group 2: Introduction of POLAR - POLAR decouples from absolute preferences and allows for efficient scaling of reward modeling, enabling adaptability to various customized needs based on reference answers [3][5]. - POLAR can assign different scores to model outputs based on varying reference styles without needing to retrain the reward model [7]. Group 3: Training Methodology of POLAR - POLAR employs a two-stage training process: pre-training and preference fine-tuning, utilizing a contrastive learning approach to measure the distance between training and target strategies [21][22]. - The pre-training phase uses a large amount of automated synthetic data, allowing for significant scalability [22][23]. Group 4: Performance and Scaling Effects - POLAR demonstrates scaling effects, with validation loss decreasing in a power-law relationship as model parameters and computational resources increase [28][29]. - In preference evaluation experiments, POLAR outperforms state-of-the-art reward models, showing significant improvements in various tasks, particularly in STEM-related tasks [32][34]. - POLAR's ability to learn subtle distinctions between strategy models enhances the generalization of reward signals in real-world applications [35].
人机协同筛出2600万条数据,七项基准全部SOTA,昆仑万维开源奖励模型再迎新突破
机器之心· 2025-07-04 02:36
Core Viewpoint - The article discusses the advancements in the Skywork-Reward-V2 series of reward models developed by Kunlun Wanwei, emphasizing their superior performance in various benchmarks and their innovative data construction methods that enhance model capabilities [4][5][8]. Group 1: Reward Model Significance - Reinforcement Learning from Human Feedback (RLHF) is crucial for ensuring that large language models (LLMs) align with human values, with the Reward Model (RM) serving as a key evaluator [2][3]. - The effectiveness of a reward model relies on its ability to accurately assess outputs, generalize across knowledge domains, and maintain flexibility in handling diverse inputs [3]. Group 2: Skywork-Reward-V2 Series - The Skywork-Reward-V2 series includes eight models with parameter sizes ranging from 600 million to 8 billion, achieving top rankings across seven major reward model evaluation benchmarks [5][7]. - The models demonstrate broad applicability, excelling in dimensions such as human preference alignment, objective correctness, safety, and resistance to style bias [7]. Group 3: Data Construction Innovations - Kunlun Wanwei has created the largest preference mixed dataset, Skywork-SynPref-40M, containing 40 million preference pairs, utilizing a two-phase iterative data selection pipeline [17][20]. - The first phase involves human-guided high-quality preference construction, while the second phase automates large-scale preference data expansion using the trained reward model [20][22]. Group 4: Performance Metrics - The Skywork-Reward-V2 models have set new records in various benchmarks, with the smallest model (Skywork-Reward-V2-Qwen3-0.6B) significantly narrowing the performance gap with larger models [31]. - The largest models, Skywork-Reward-V2-Llama-3.1-8B and Skywork-Reward-V2-Llama-3.1-8B-40M, have outperformed leading closed-source models in all major benchmark tests [32]. Group 5: Future Implications - The advancements in the Skywork-Reward-V2 series suggest a shift towards data-driven alignment techniques in RLHF, potentially leading to further evolution in the field [45][46]. - The combination of human and AI-driven data annotation methods is expected to enhance the scalability and quality of preference data, thereby improving the performance of large models [46][47].
DeepSeek-R2为什么还没发?
量子位· 2025-06-27 08:09
Core Viewpoint - The release of DeepSeek-R2 has been delayed due to CEO Liang Wenfeng's dissatisfaction with its performance and a shortage of Nvidia H20 chips, which are critical for its development [1][2][4]. Development Timeline - The anticipation for R2 began after the release of the DeepSeek-V3 model in December last year, which was considered a benchmark for cost-performance [5]. - An upgrade to V3 was announced in March 2023, leading to speculation that R2 would be released in April [11]. - Despite the release of a paper on scaling laws in early April, there has been no official update on R2 since then [12][16]. Technical Specifications - R1's training utilized 30,000 H20 chips, 10,000 H800 chips, and 10,000 H100 chips, indicating the significant computational resources required for R2 [3]. - Leaked parameters for R2 suggested it would have 1.2 trillion parameters and utilize 5.2 petabytes of training data, although the authenticity of these claims remains uncertain [17]. Community Reactions - Following the news of the delay, community responses varied, with some expressing belief that the delay is justified, while others speculated that R2 might wait for the release of V4 [26][30].