突破多模态奖励瓶颈!中科院清华快手联合提出R1-Reward,用强化学习赋予模型长期推理能力
量子位·2025-05-08 06:58

Core Viewpoint - The article discusses the development of the R1-Reward model, which utilizes a stable reinforcement learning algorithm (StableReinforce) to enhance the performance of multi-modal reward models (MRMs) in multi-modal large language models (MLLMs) [1][45]. Group 1: Model Development and Performance - The R1-Reward model achieves a performance improvement of 5%-15% compared to the current state-of-the-art (SOTA) models in existing multi-modal reward model benchmarks [2]. - The model's performance can further increase with more inference sampling, indicating the potential for significant optimization through reinforcement learning [3]. - R1-Reward demonstrates outstanding results on several mainstream multi-modal reward model evaluation benchmarks, significantly surpassing previous best models, with improvements of 8.4% and 14.3% on different leaderboards [11][38]. Group 2: Key Contributions and Innovations - The model provides stable rewards during training and selects better sample results during evaluation, and can also function as an evaluator independently [4]. - The introduction of a "consistency reward" mechanism ensures that the model's analysis aligns with its final answer, promoting logical judgments [11][31]. - The research team collected 200,000 preference data points to construct the R1-Reward-200k dataset for training, employing a progressive difficulty training strategy to enhance model learning [11][34]. Group 3: Algorithm Enhancements - The StableReinforce algorithm addresses the limitations of existing reinforcement learning methods by introducing improvements such as Pre-Clip and Advantage Filter to stabilize training and enhance performance [9][26]. - The Pre-Clip strategy mitigates the impact of large ratio differences during probability calculations, while the Advantage Filter retains only samples within a specified range to avoid extreme values affecting training stability [23][26]. - The model's average output length decreased by approximately 15% after reinforcement learning training, suggesting increased efficiency [44]. Group 4: Future Directions - The article highlights the potential for further exploration in the application of reinforcement learning in reward modeling, including advanced voting strategies for inference and improved training methods to enhance the model's foundational capabilities [45].