Workflow
RLVR
icon
Search documents
突破通用领域推理的瓶颈!清华NLP实验室强化学习新研究RLPR
机器之心· 2025-06-27 00:49
Core Viewpoint - The article discusses the introduction of a novel reinforcement learning technique called Reinforcement Learning with Reference Probability Reward (RLPR), which addresses the limitations of existing methods in generalizing to diverse domains beyond mathematics and coding [4][24]. Group 1: RLPR Technology Overview - RLPR significantly enhances the quality of probability-based rewards through the Prob-to-Reward method, outperforming likelihood-based baseline methods in performance and training stability [7][24]. - The technology introduces a dynamic filtering mechanism based on reward standard deviation, further improving the stability and performance of reinforcement learning [8][17]. Group 2: Effectiveness of PR - The research team found that the generation probability of reference answers in large language models (LLMs) directly reflects the quality assessment of the model's reasoning process, indicating a strong correlation between the model's reasoning accuracy and the probability of generating correct reference answers [11][24]. - The PR mechanism effectively captures the model's self-assessment of reasoning quality, demonstrating its reliability in evaluating output [11][13]. Group 3: Advantages Over Existing Methods - Unlike existing RLVR methods that require extensive human resources for domain-specific validation rules, RLPR generates reward scores with a simple forward pass, making it more efficient in handling the complexity of natural language [13][24]. - RLPR's dynamic filtering mechanism retains samples with high reward standard deviation for training, enhancing training stability and effectiveness [17][24]. Group 4: Robustness and Validation - The research team evaluated the quality of different reward sources using the ROC-AUC metric, showing that PR outperformed rule-based rewards and verifier model rewards at a scale of 0.5 billion, with further improvements possible as model capabilities increase [19][21]. - RLPR demonstrated stable performance improvements across various training templates and base models, including Gemma and Llama, surpassing the performance of traditional rule-based RLVR baselines [22][24].