可验证奖励强化学习(RLVR)
Search documents
这些大神在Meta的论文看一篇少一篇了
3 6 Ke· 2025-11-17 09:52
Core Insights - The article discusses a perplexing phenomenon in large model reinforcement learning (RL) training, where significant performance improvements occur despite minimal parameter changes [1][3]. Group 1: Research Findings - The paper analyzes the dynamics of Verifiable Reward Reinforcement Learning (RLVR) training, debunking the misconception that sparse parameter updates are merely superficial; instead, it reveals a fixed optimization bias inherent in RLVR [3][5]. - The research introduces a new framework called the Three-Gate Theory, which explains how RLVR parameter updates are directed towards specific parameter regions [5][7]. Group 2: Parameter Update Characteristics - The study highlights a paradox where RL training yields high performance gains with sparse parameter updates, contrasting with the dense updates seen in supervised fine-tuning (SFT) [5][6]. - The sparsity of updates in RL training ranges from 36% to 92%, while SFT shows sparsity between 0.6% and 18.8%, indicating a significant difference in update density [5][6]. Group 3: Three-Gate Theory Components - The first gate, KL Anchoring, ensures that RL updates do not deviate significantly from the model's original output style, maintaining a small drift in parameter movement [8]. - The second gate, Model Geometry, indicates that RL updates prefer low-curvature directions in the optimization landscape, preserving the model's original weight structure [9]. - The third gate, Precision, explains that the limited precision of bfloat16 can mask small updates in RL, leading to the appearance of sparsity [11]. Group 4: Implications for Parameter Efficient Fine-Tuning - The findings suggest that many parameter-efficient fine-tuning (PEFT) methods from the SFT era do not transfer well to RLVR, particularly those aligned with sparse or low-rank priors [17]. - The study indicates that updating non-principal, low-amplitude weights aligns better with RLVR's optimization trajectory, while methods like PiSSA may not provide additional benefits and can lead to instability [17].
不改模型也能提升推理性能?ICLR投稿提出测试时扩展新范式OTV
量子位· 2025-10-23 00:08
Core Insights - The article discusses the challenges faced by large language models, including hallucinations, logical errors, and reasoning flaws, prompting researchers to explore new methods to enhance output reliability [1] - A novel approach called One-Token Verification (OTV) is introduced, which allows models to monitor their reasoning process in real-time without altering the original model structure or parameters [2] Summary by Sections Current Mainstream Paradigms - LoRA fine-tuning is highlighted as a popular parameter-efficient tuning method that avoids full parameter training and is easy to deploy, but it often relies on detailed supervised data and can lead to "forgetting effects" [3] - Quality screening of generated results can enhance output credibility but tends to be reactive, making it difficult to correct the model's reasoning in real-time and lacking insight into the internal reasoning process [4] Parallel Thinking Framework - The article introduces the concept of Parallel Thinking, which allows language models to generate multiple reasoning paths simultaneously and then filter them through a specific mechanism [5] - OTV builds on this framework by focusing on efficiently selecting correct reasoning paths at a lower cost rather than generating multiple paths [5] OTV Mechanism - OTV employs an internal verifier that analyzes the reasoning process using a lightweight role vector implemented via LoRA, running in parallel with the original model [9] - The internal verifier utilizes the key-value cache (KV Cache) from the Transformer architecture to capture rich information about the model's internal dynamics during the reasoning process [9] - A special token, referred to as "Token of Truth" (ToT), is inserted during the verification phase to assess the correctness of the reasoning path [9] Training and Efficiency - OTV's internal verifier is designed to be lightweight, with a training logic that assigns heuristic pseudo-labels based on the correctness of the final answer [10] - The training process is highly parallelized, allowing simultaneous scoring predictions for all positions, making it computationally comparable to traditional LoRA fine-tuning [10] Experimental Validation - OTV was systematically evaluated on various open-source models, demonstrating superior accuracy and a preference for shorter, more accurate reasoning paths compared to baseline methods [14] - The results indicate that OTV can read the internal reasoning state and output quality, significantly outperforming general methods that rely solely on output text [15] Dynamic Control of Computational Costs - OTV enables models to dynamically control computational expenses by real-time elimination of low-quality paths based on confidence scores, leading to a reduction in computational load by nearly 90% while maintaining optimal accuracy [17] Future Prospects - The OTV framework opens avenues for deeper integration with original models and the potential for a three-state system that includes "uncertain" states, enhancing selective prediction capabilities [25][26] - The approach could also be extended to different model architectures, optimizing KV cache structures to further improve reasoning efficiency and representation utilization [26]
OpenAI路线遭质疑,Meta研究员:根本无法构建超级智能
3 6 Ke· 2025-06-20 12:00
Core Insights - The pursuit of "superintelligence" represents a significant ambition among leading AI companies like Meta, OpenAI, and Google DeepMind, with substantial investments being made in this direction [1][3][4] - Sam Altman of OpenAI suggests that building superintelligence is primarily an engineering challenge, indicating a belief in a feasible path to achieve it [3][4] - Meta AI researcher Jack Morris argues that the current approach of using large language models (LLMs) and reinforcement learning (RL) may not be sufficient to construct superintelligence [1][2] Group 1: Current Approaches and Challenges - Morris outlines three potential methods for building superintelligence: purely supervised learning (SL), RL from human validators, and RL from automated validators [2] - The integration of non-text data into models is believed not to enhance overall performance, as human-written text carries intrinsic value that sensory inputs do not [2][6] - The concept of a "data wall" or "token crisis" is emerging, where the availability of text data for training LLMs is becoming a concern, leading to extensive efforts to scrape and transcribe data from various sources [8][19] Group 2: Learning Algorithms and Their Implications - The two primary learning methods identified for potential superintelligence are SL and RL, with SL being more stable and efficient for initial training [10][22] - The hypothesis that superintelligence could emerge from SL alone is challenged by the limitations of current models, which may not exhibit human-level general intelligence despite excelling in specific tasks [15][16] - The combination of SL and RL is proposed as a more viable path, leveraging human feedback or automated systems to refine model outputs [20][22][28] Group 3: Future Directions and Speculations - The potential for RL to effectively transfer learning across various tasks remains uncertain, raising questions about the scalability of this approach to achieve superintelligence [34] - The competitive landscape among AI companies is likely to intensify as they seek to develop the most effective training environments for LLMs, potentially leading to breakthroughs in superintelligence [34]
LLM加RL遭质疑:故意用错奖励,数学基准也显著提升,AI圈炸了
机器之心· 2025-05-28 08:09
Core Insights - The article discusses a recent paper that challenges the effectiveness of reinforcement learning (RL) in training large language models (LLMs), particularly in the context of using false rewards to enhance performance [3][4][5]. Group 1: Findings on Reinforcement Learning - The study reveals that using false rewards, including random and incorrect rewards, can significantly improve the performance of the Qwen2.5-Math-7B model on the MATH-500 benchmark, with random rewards improving scores by 21% and incorrect rewards by 25% compared to a 28.8% improvement with true rewards [5][10]. - The research questions the traditional belief that high-quality supervision signals are essential for effective RL training, suggesting that even minimal or misleading signals can yield substantial improvements [7][19]. Group 2: Model-Specific Observations - The effectiveness of RL with false rewards appears to be model-dependent, as other models like Llama3 and OLMo2 did not show similar performance gains when subjected to false rewards [16][17]. - The Qwen model demonstrated a unique ability to leverage code generation for mathematical reasoning, achieving a code generation frequency of 65% prior to RL training, which increased to over 90% post-training [28][34]. Group 3: Implications for Future Research - The findings indicate that future RL research should explore the applicability of these methods across diverse model families, rather than relying solely on a single model's performance [25][49]. - Understanding the pre-existing reasoning patterns learned during pre-training is crucial for designing effective RL training strategies, as these patterns significantly influence downstream performance [50].