Workflow
多模态奖励模型
icon
Search documents
RL训练总崩溃?R1-Reward稳定解锁奖励模型Long-Cot推理能力
机器之心· 2025-05-12 04:31
Core Viewpoint - The article discusses the development and application of the R1-Reward model, which utilizes a new algorithm called StableReinforce to enhance the performance of multimodal reward models (MRMs) through reinforcement learning (RL) techniques, addressing issues of training instability and inconsistency in reward modeling [1][38]. Group 1: R1-Reward Model and Its Applications - R1-Reward has shown significant academic value and has been successfully applied in practical scenarios at Kuaishou, such as in short videos, e-commerce, and live streaming, achieving notable performance improvements [2]. - The R1-Reward model outperforms state-of-the-art (SOTA) models by 5%-15% on existing multimodal reward model benchmarks, with further improvements observed as the number of inference samples increases [1][38]. Group 2: Algorithm Improvements - The article introduces a new algorithm, StableReinforce, which optimizes existing RL methods to enhance training stability and efficiency [9]. - Key improvements include a gradual training strategy, a robust advantage value handling method called Advantage Filter, and a novel "consistency reward" mechanism that checks the coherence between the model's analysis and its final answer [12][25]. Group 3: Training Methodology - The training process involves a two-step approach: first, a supervised fine-tuning (SFT) phase using a dataset of 200,000 preference data points, followed by a reinforcement learning phase focusing on more challenging samples [27][30]. - The SFT phase allows the model to learn the task format and process, while the RL phase targets samples deemed "harder" to improve the model's ability to discern subtle differences [32]. Group 4: Experimental Results - R1-Reward has demonstrated exceptional performance on multiple multimodal reward model leaderboards, significantly surpassing previous best models [34]. - A voting strategy during evaluation, where the model outputs multiple judgments and selects the most frequent answer, has led to substantial accuracy improvements, with accuracy rising from approximately 71% to 86.47% when voting 15 times [35]. Group 5: Future Directions - The article suggests that there are many unexplored avenues for applying RL in reward modeling, including advanced voting strategies and improved training methodologies to further enhance the model's foundational capabilities [38].
突破多模态奖励瓶颈!中科院清华快手联合提出R1-Reward,用强化学习赋予模型长期推理能力
量子位· 2025-05-08 06:58
Core Viewpoint - The article discusses the development of the R1-Reward model, which utilizes a stable reinforcement learning algorithm (StableReinforce) to enhance the performance of multi-modal reward models (MRMs) in multi-modal large language models (MLLMs) [1][45]. Group 1: Model Development and Performance - The R1-Reward model achieves a performance improvement of 5%-15% compared to the current state-of-the-art (SOTA) models in existing multi-modal reward model benchmarks [2]. - The model's performance can further increase with more inference sampling, indicating the potential for significant optimization through reinforcement learning [3]. - R1-Reward demonstrates outstanding results on several mainstream multi-modal reward model evaluation benchmarks, significantly surpassing previous best models, with improvements of 8.4% and 14.3% on different leaderboards [11][38]. Group 2: Key Contributions and Innovations - The model provides stable rewards during training and selects better sample results during evaluation, and can also function as an evaluator independently [4]. - The introduction of a "consistency reward" mechanism ensures that the model's analysis aligns with its final answer, promoting logical judgments [11][31]. - The research team collected 200,000 preference data points to construct the R1-Reward-200k dataset for training, employing a progressive difficulty training strategy to enhance model learning [11][34]. Group 3: Algorithm Enhancements - The StableReinforce algorithm addresses the limitations of existing reinforcement learning methods by introducing improvements such as Pre-Clip and Advantage Filter to stabilize training and enhance performance [9][26]. - The Pre-Clip strategy mitigates the impact of large ratio differences during probability calculations, while the Advantage Filter retains only samples within a specified range to avoid extreme values affecting training stability [23][26]. - The model's average output length decreased by approximately 15% after reinforcement learning training, suggesting increased efficiency [44]. Group 4: Future Directions - The article highlights the potential for further exploration in the application of reinforcement learning in reward modeling, including advanced voting strategies for inference and improved training methods to enhance the model's foundational capabilities [45].