Workflow
可验证奖励强化学习(RLVR)
icon
Search documents
这些大神在Meta的论文看一篇少一篇了
量子位· 2025-11-17 04:52
Core Insights - The article discusses the recent research led by Tian Yuandong and his team on the dynamics of Reinforcement Learning with Verifiable Rewards (RLVR), revealing that despite significant performance improvements, only a small number of parameters are updated during training [2][4][5]. Group 1: Research Findings - The study identifies a misconception regarding the sparse parameter updates in RL training, suggesting that this sparsity is merely a surface phenomenon, with a deeper mechanism of model-conditioned optimization bias at play [4][10]. - The team introduced the Three-Gate Theory to explain how RL updates are constrained, guided, and filtered, leading to specific parameter regions being targeted for updates [6][11]. - The research highlights that RL training results in a high return with low parameter changes, contrasting with the dense updates seen in supervised fine-tuning (SFT) [8][9]. Group 2: Experimental Results - The analysis of various models, including Qwen series and DeepSeek-R1, showed that RL training led to parameter sparsity ranging from 36% to 92%, while SFT exhibited sparsity between 0.6% and 18.8% [9][10]. - The experiments confirmed that RLVR and SFT optimize different regions in the parameter space, with RL updates showing a strong tendency to avoid high-curvature areas, which are more sensitive to changes [18][20]. - The study also demonstrated that updating non-principal components and low-amplitude weights aligns with the theoretical predictions, allowing for better tracking of dense RLVR trajectories [27][28]. Group 3: Implications for Future Research - The findings suggest that many parameter-efficient fine-tuning (PEFT) methods from the SFT era may not transfer well to RLVR, particularly those aligned with sparse or low-rank priors [25][26]. - The research indicates that using higher learning rates in recent LoRA variants can lead to instability and premature collapse, as these methods tend to force updates along principal directions that RLVR avoids [29].
颠覆大模型后训练,陈丹琦团队提出「基于模型奖励思维的强化学习」RLMT
3 6 Ke· 2025-09-29 10:54
Core Insights - The article discusses a breakthrough in enhancing the reasoning capabilities of large language models (LLMs) through a new framework called Reinforcement Learning with Model Thinking (RLMT), which allows models to generate detailed reasoning chains before producing responses [2][6][25] - The RLMT framework combines the strengths of two existing paradigms: Reinforcement Learning from Human Feedback (RLHF) and Verifiable Reward Reinforcement Learning (RLVR), enabling better performance in open-ended tasks [6][8][25] - The research indicates that models trained with RLMT outperform existing models like GPT-4o and Llama-3.1-8B-Instruct, even with significantly fewer training prompts [3][16][25] Summary by Sections RLMT Framework - RLMT requires LLMs to produce a detailed reasoning trajectory before generating final responses, optimizing the entire process through online reinforcement learning [7][8] - The framework retains the RLVR approach of generating reasoning first while incorporating a preference-based reward model from RLHF, allowing models to learn to "think" in open-ended tasks [6][8] Model Performance - An 8 billion parameter model trained with RLMT surpassed GPT-4o in chat and creative writing tasks, achieving comparable performance to Claude-3.7-Sonnet [3][16] - The Llama-3.1-8B model trained with RLMT achieved an average score of 50.4 on WildBench, outperforming larger models with nearly ten times the parameters [16][17] Training Methodology - The RLMT framework demonstrated significant improvements even in zero-training scenarios, where the Llama-3.1-8B-RLMT-Zero model scored 15.6, surpassing the Llama-3.1-8B-Instruct model trained with over 25 million samples [18][25] - The research emphasizes that the quality of prompts, the strength of the reward model, and the reasoning process are critical for the success of RLMT [20][25] Implications for Future Research - The findings suggest a paradigm shift in language model training, indicating that enhancing a model's reasoning ability may be more effective than relying solely on large datasets [25][26] - Future research could explore optimizing reasoning formats and extending RLMT to other domains such as logical reasoning and multimodal models [25][26]