RL训练总崩溃？R1-Reward稳定解锁奖励模型Long-Cot推理能力

Core Viewpoint - The article discusses the development and application of the R1-Reward model, which utilizes a new algorithm called StableReinforce to enhance the performance of multimodal reward models (MRMs) through reinforcement learning (RL) techniques, addressing issues of training instability and inconsistency in reward modeling [1][38]. Group 1: R1-Reward Model and Its Applications - R1-Reward has shown significant academic value and has been successfully applied in practical scenarios at Kuaishou, such as in short videos, e-commerce, and live streaming, achieving notable performance improvements [2]. - The R1-Reward model outperforms state-of-the-art (SOTA) models by 5%-15% on existing multimodal reward model benchmarks, with further improvements observed as the number of inference samples increases [1][38]. Group 2: Algorithm Improvements - The article introduces a new algorithm, StableReinforce, which optimizes existing RL methods to enhance training stability and efficiency [9]. - Key improvements include a gradual training strategy, a robust advantage value handling method called Advantage Filter, and a novel "consistency reward" mechanism that checks the coherence between the model's analysis and its final answer [12][25]. Group 3: Training Methodology - The training process involves a two-step approach: first, a supervised fine-tuning (SFT) phase using a dataset of 200,000 preference data points, followed by a reinforcement learning phase focusing on more challenging samples [27][30]. - The SFT phase allows the model to learn the task format and process, while the RL phase targets samples deemed "harder" to improve the model's ability to discern subtle differences [32]. Group 4: Experimental Results - R1-Reward has demonstrated exceptional performance on multiple multimodal reward model leaderboards, significantly surpassing previous best models [34]. - A voting strategy during evaluation, where the model outputs multiple judgments and selects the most frequent answer, has led to substantial accuracy improvements, with accuracy rising from approximately 71% to 86.47% when voting 15 times [35]. Group 5: Future Directions - The article suggests that there are many unexplored avenues for applying RL in reward modeling, including advanced voting strategies and improved training methodologies to further enhance the model's foundational capabilities [38].