可验证奖励强化学习(RLVR)
Search documents
大语言模型2025这一年
Zhong Guo Jing Ying Bao· 2025-12-30 09:40
Core Insights - The large language model (LLM) industry has seen significant development by 2025, with companies like DeepSeek emerging as strong competitors through open-source strategies and advanced reasoning capabilities [1] - Major players such as OpenAI, Google, Tencent, Alibaba, and ByteDance continue to compete in technology, application, and ecosystem development, leveraging their advantages in user acquisition and problem-solving [1] Group 1: Company Developments - DeepSeek has made notable advancements with its DeepSeek-V3 model, which features a total of 671 billion parameters and excels in mathematical reasoning and code generation, competing with closed-source models like GPT-4o [2] - The introduction of DeepSeek-V3.2 aims to balance reasoning capabilities and output length, while DeepSeek-V3.2-Speciale pushes the limits of reasoning ability [3] - ByteDance's Doubao model has achieved a daily token usage exceeding 50 trillion, making it the leading AI model in China and the third globally [3] Group 2: Technological Innovations - Tencent's mixed Yuan model has progressed from technical breakthroughs to comprehensive ecosystem applications, showcasing a clear path from technology to practical implementation [2] - The Qwen2.5-VL-32B-Instruct model utilizes a unified Transformer architecture, improving cross-modal generation accuracy by over 40% [4] - Zhizhu AI has doubled its parameter scale from 5 trillion to 10 trillion, achieving a reasoning accuracy of 98.5%, nearing international standards [4] Group 3: Future Trends - The future of LLMs is characterized by becoming "smarter, more vertical, and closer to life," transitioning from technical breakthroughs to deep applications across various fields [7] - The rise of localized intelligent agents, such as Anthropic's Claude Code, allows for low-latency interactions by deploying directly on user devices [8] - The industry is expected to see significant advancements in embodied intelligence applications, which combine physical AI with large models, aligning with national development goals [9]
大模型的2025:6个关键洞察
3 6 Ke· 2025-12-23 11:39
除了技术路径的更迭,卡帕西还对智能的本质提出了深刻见解。 在这份综述中,卡 帕西详尽地剖析了过去一年中大语言模型 (LLM) 领域发生的底层范式转移。他指出,2025年标志着AI训练哲学从 单纯的"概率模仿"向"逻辑推理"的决定性跨越。 这一转变的核心动力源于可验证奖励强化学习 (RLVR) 的成熟,它通过数学与代码等客观反馈环境,迫使模型自发生成类似于人类思 维的"推理痕迹"。卡帕西认为,这种长周期的强化学习已经开始蚕食传统的预训练份额,成为提升模型能力的新引擎。 北京时间12月21日,OpenAI创始人之一、AI大神安德烈·卡帕西(Andrej Karpathy)发布了名为《2025年大语言模型年度回顾》(2025 LLM Year in Review)的年度深度观察报告。 他用"召唤幽灵" (Summoning Ghosts) 而非"进化动物" ( E volving/growing Animals) 来比喻当前AI的成长模式,解释了为何当前的大语 言模型会展现出"锯齿状"的性能特征——在尖端领域表现如天才,却在基础常识上可能如孩童般脆弱。 此外,卡帕西也对"氛围编程 ( Vi be Coding) " ...
大模型的2025:6个关键洞察
腾讯研究院· 2025-12-23 08:33
Core Insights - The article discusses a significant paradigm shift in the field of large language models (LLMs) in 2025, moving from "probabilistic imitation" to "logical reasoning" driven by the maturity of verifiable reward reinforcement learning (RLVR) [2][3] - The author emphasizes that the potential of LLMs has only been explored to less than 10%, indicating vast future development opportunities [3][25] Group 1: Technological Advancements - In 2025, RLVR emerged as the core new phase in training LLMs, allowing models to autonomously generate reasoning traces by training in environments with verifiable rewards [7][8] - The increase in model capabilities in 2025 was primarily due to the exploration and release of the "stock potential" of RLVR, rather than significant changes in model parameter sizes [8][9] - The introduction of the o1 model at the end of 2024 and the o3 model in early 2025 marked a qualitative leap in LLM capabilities [9] Group 2: Nature of Intelligence - The author argues that LLMs should be viewed as "summoned ghosts" rather than "evolving animals," highlighting a fundamental difference in their intelligence compared to biological entities [10][11] - The performance of LLMs exhibits a "sawtooth" characteristic, excelling in advanced fields while struggling with basic common knowledge [12][13] Group 3: New Applications and Interfaces - The emergence of Cursor represents a new application layer for LLMs, focusing on context engineering and optimizing prompt design for specific verticals [15] - The introduction of Claude Code (CC) demonstrated the core capabilities of LLM agents, operating locally on user devices and accessing private data [17][18] - The concept of "atmospheric programming" allows users to create powerful programs using natural language, democratizing programming skills [20][21] Group 4: Future Directions - The article suggests that the future of LLMs will involve a shift towards visual and interactive interfaces, moving beyond text-based interactions [24] - The potential for innovation in the LLM space remains vast, with many ideas yet to be explored, indicating a continuous evolution in the industry [25]
大模型的2025:6个关键洞察,来自OpenAI创始人、AI大神“AK”
3 6 Ke· 2025-12-22 04:22
他用"召唤幽灵"(Summoning Ghosts)而非"进化动物"(Evolving/growing Animals)来比喻当前AI的成长模式,解释了为何当前的大语言模 型会展现出"锯齿状"的性能特征——在尖端领域表现如天才,却在基础常识上可能如孩童般脆弱。 此外,卡帕西也对"氛围编程(Vibe Coding)"的兴起、本地化智能体的实用化趋势,以及大语言模型图形界面(LLM GUI)的演进进行 了详实的论述。他强调,虽然行业进步迅猛,但人类目前对这一新计算范式潜力的挖掘尚不足10%,未来的发展空间依旧极其广阔。 卡帕西揭示了一个冷酷却又充满希望的现实:我们正处于从"模拟人类智能"向"纯粹机器智能"跨越的临界点。随着RLVR等技术的普 及,2026年的AI竞争将不再局限于算力的军备竞赛,而是转向对"如何让AI高效思考"这一核心逻辑范式的深度挖掘。 以下为卡帕西年度回顾全文: 北京时间12月21日,OpenAI创始人之一、AI大神安德烈·卡帕西(Andrej Karpathy)发布了名为《2025年大语言模型年度回顾》(2025 LLM Year in Review)的年度深度观察报告。 在这份综述中,卡帕西 ...
近两百万人围观的Karpathy年终大语言模型清单,主角是它们
机器之心· 2025-12-21 03:01
编辑|杜伟 2025 年还有 10 天就要结束,这意味着是时候进行一波年终总结了。 对于人工智能领域而言,2025 年是大语言模型(LLM)快速演进、重磅事件密集出现的一年。 就在昨天,知名 AI 学者 Karpathy 列出了一份清单,记录了他个人认为最重要、也多少有些出乎意料的「范式转变」。 这些真正改变了行业格局、并在概念层面让 Karpathy 印象深刻的变化会落在哪些领域呢?我们接下来一一来看(以第一人称)。 可验证奖励强化学习(RLVR) 2025 年初,几乎所有实验室的 LLM 生产训练流程都像下面这样: 这套流程稳定、可靠,曾长期被视为「工业级 LLM」的标准做法。 预训练(类似 2020 年的 GPT-2/3); 监督微调(SFT,类似 2022 年的 InstructGPT) 基于人类反馈的强化学习(RLHF,约 2022 年) 但在 2025 年,一种新的阶段浮出水面,并迅速成为事实上的标配: 可验证奖励强化学习(Reinforcement Learning from Verifiable Rewards,RLVR) 。 RLVR 的核心做法是,让模型在可自动验证的环境中接受强化学习训练 ...
卡帕西2025大模型总结火爆硅谷
量子位· 2025-12-20 04:20
鹭羽 发自 凹非寺 量子位 | 公众号 QbitAI 2025都有哪些AI趋势,大神 卡帕西 的年终总结,正在火爆硅谷。 6大论断,硬核又颇有启发: 新范式、新应用、新模型……回首望去,过去一年大模型带来的变革让人兴奋。 然而卡帕西大胆预言: 大模型的潜力,才刚刚挖掘10%。 一切不过是刚刚开始…… 2025LLM年度回顾 为什么卡帕西认为大模型潜力只挖掘了10%? 一方面展现出强大的推理能力,另一方面也暴露出潜在的理解缺陷 ,既让人兴奋又让人谨慎,具体包括: RLVR (可验证奖励强化学习) 成为训练新阶段 大模型不应被类比为动物智能 Cursor展现了大模型应用的Next Level Claude Code加速端侧智能体普及 Vibe Coding将重塑软件行业 Nano Banana重塑人机交互 RLVR成为训练新阶段 在年初之前,全世界的大模型都基本遵循以下训练范式: 而到了2025年,RLVR开始加入其中。 模型通过在可自动验证的奖励环境中进行强化学习训练,会自发地形成推理策略,比如将问题分解为中间计算、循环计算等,具体可参考 DeepSeek R1 。 而这些策略如果用旧范式其实极难实现,因为大模 ...
这些大神在Meta的论文看一篇少一篇了
3 6 Ke· 2025-11-17 09:52
Core Insights - The article discusses a perplexing phenomenon in large model reinforcement learning (RL) training, where significant performance improvements occur despite minimal parameter changes [1][3]. Group 1: Research Findings - The paper analyzes the dynamics of Verifiable Reward Reinforcement Learning (RLVR) training, debunking the misconception that sparse parameter updates are merely superficial; instead, it reveals a fixed optimization bias inherent in RLVR [3][5]. - The research introduces a new framework called the Three-Gate Theory, which explains how RLVR parameter updates are directed towards specific parameter regions [5][7]. Group 2: Parameter Update Characteristics - The study highlights a paradox where RL training yields high performance gains with sparse parameter updates, contrasting with the dense updates seen in supervised fine-tuning (SFT) [5][6]. - The sparsity of updates in RL training ranges from 36% to 92%, while SFT shows sparsity between 0.6% and 18.8%, indicating a significant difference in update density [5][6]. Group 3: Three-Gate Theory Components - The first gate, KL Anchoring, ensures that RL updates do not deviate significantly from the model's original output style, maintaining a small drift in parameter movement [8]. - The second gate, Model Geometry, indicates that RL updates prefer low-curvature directions in the optimization landscape, preserving the model's original weight structure [9]. - The third gate, Precision, explains that the limited precision of bfloat16 can mask small updates in RL, leading to the appearance of sparsity [11]. Group 4: Implications for Parameter Efficient Fine-Tuning - The findings suggest that many parameter-efficient fine-tuning (PEFT) methods from the SFT era do not transfer well to RLVR, particularly those aligned with sparse or low-rank priors [17]. - The study indicates that updating non-principal, low-amplitude weights aligns better with RLVR's optimization trajectory, while methods like PiSSA may not provide additional benefits and can lead to instability [17].
不改模型也能提升推理性能?ICLR投稿提出测试时扩展新范式OTV
量子位· 2025-10-23 00:08
Core Insights - The article discusses the challenges faced by large language models, including hallucinations, logical errors, and reasoning flaws, prompting researchers to explore new methods to enhance output reliability [1] - A novel approach called One-Token Verification (OTV) is introduced, which allows models to monitor their reasoning process in real-time without altering the original model structure or parameters [2] Summary by Sections Current Mainstream Paradigms - LoRA fine-tuning is highlighted as a popular parameter-efficient tuning method that avoids full parameter training and is easy to deploy, but it often relies on detailed supervised data and can lead to "forgetting effects" [3] - Quality screening of generated results can enhance output credibility but tends to be reactive, making it difficult to correct the model's reasoning in real-time and lacking insight into the internal reasoning process [4] Parallel Thinking Framework - The article introduces the concept of Parallel Thinking, which allows language models to generate multiple reasoning paths simultaneously and then filter them through a specific mechanism [5] - OTV builds on this framework by focusing on efficiently selecting correct reasoning paths at a lower cost rather than generating multiple paths [5] OTV Mechanism - OTV employs an internal verifier that analyzes the reasoning process using a lightweight role vector implemented via LoRA, running in parallel with the original model [9] - The internal verifier utilizes the key-value cache (KV Cache) from the Transformer architecture to capture rich information about the model's internal dynamics during the reasoning process [9] - A special token, referred to as "Token of Truth" (ToT), is inserted during the verification phase to assess the correctness of the reasoning path [9] Training and Efficiency - OTV's internal verifier is designed to be lightweight, with a training logic that assigns heuristic pseudo-labels based on the correctness of the final answer [10] - The training process is highly parallelized, allowing simultaneous scoring predictions for all positions, making it computationally comparable to traditional LoRA fine-tuning [10] Experimental Validation - OTV was systematically evaluated on various open-source models, demonstrating superior accuracy and a preference for shorter, more accurate reasoning paths compared to baseline methods [14] - The results indicate that OTV can read the internal reasoning state and output quality, significantly outperforming general methods that rely solely on output text [15] Dynamic Control of Computational Costs - OTV enables models to dynamically control computational expenses by real-time elimination of low-quality paths based on confidence scores, leading to a reduction in computational load by nearly 90% while maintaining optimal accuracy [17] Future Prospects - The OTV framework opens avenues for deeper integration with original models and the potential for a three-state system that includes "uncertain" states, enhancing selective prediction capabilities [25][26] - The approach could also be extended to different model architectures, optimizing KV cache structures to further improve reasoning efficiency and representation utilization [26]
OpenAI路线遭质疑,Meta研究员:根本无法构建超级智能
3 6 Ke· 2025-06-20 12:00
Core Insights - The pursuit of "superintelligence" represents a significant ambition among leading AI companies like Meta, OpenAI, and Google DeepMind, with substantial investments being made in this direction [1][3][4] - Sam Altman of OpenAI suggests that building superintelligence is primarily an engineering challenge, indicating a belief in a feasible path to achieve it [3][4] - Meta AI researcher Jack Morris argues that the current approach of using large language models (LLMs) and reinforcement learning (RL) may not be sufficient to construct superintelligence [1][2] Group 1: Current Approaches and Challenges - Morris outlines three potential methods for building superintelligence: purely supervised learning (SL), RL from human validators, and RL from automated validators [2] - The integration of non-text data into models is believed not to enhance overall performance, as human-written text carries intrinsic value that sensory inputs do not [2][6] - The concept of a "data wall" or "token crisis" is emerging, where the availability of text data for training LLMs is becoming a concern, leading to extensive efforts to scrape and transcribe data from various sources [8][19] Group 2: Learning Algorithms and Their Implications - The two primary learning methods identified for potential superintelligence are SL and RL, with SL being more stable and efficient for initial training [10][22] - The hypothesis that superintelligence could emerge from SL alone is challenged by the limitations of current models, which may not exhibit human-level general intelligence despite excelling in specific tasks [15][16] - The combination of SL and RL is proposed as a more viable path, leveraging human feedback or automated systems to refine model outputs [20][22][28] Group 3: Future Directions and Speculations - The potential for RL to effectively transfer learning across various tasks remains uncertain, raising questions about the scalability of this approach to achieve superintelligence [34] - The competitive landscape among AI companies is likely to intensify as they seek to develop the most effective training environments for LLMs, potentially leading to breakthroughs in superintelligence [34]
LLM加RL遭质疑:故意用错奖励,数学基准也显著提升,AI圈炸了
机器之心· 2025-05-28 08:09
Core Insights - The article discusses a recent paper that challenges the effectiveness of reinforcement learning (RL) in training large language models (LLMs), particularly in the context of using false rewards to enhance performance [3][4][5]. Group 1: Findings on Reinforcement Learning - The study reveals that using false rewards, including random and incorrect rewards, can significantly improve the performance of the Qwen2.5-Math-7B model on the MATH-500 benchmark, with random rewards improving scores by 21% and incorrect rewards by 25% compared to a 28.8% improvement with true rewards [5][10]. - The research questions the traditional belief that high-quality supervision signals are essential for effective RL training, suggesting that even minimal or misleading signals can yield substantial improvements [7][19]. Group 2: Model-Specific Observations - The effectiveness of RL with false rewards appears to be model-dependent, as other models like Llama3 and OLMo2 did not show similar performance gains when subjected to false rewards [16][17]. - The Qwen model demonstrated a unique ability to leverage code generation for mathematical reasoning, achieving a code generation frequency of 65% prior to RL training, which increased to over 90% post-training [28][34]. Group 3: Implications for Future Research - The findings indicate that future RL research should explore the applicability of these methods across diverse model families, rather than relying solely on a single model's performance [25][49]. - Understanding the pre-existing reasoning patterns learned during pre-training is crucial for designing effective RL training strategies, as these patterns significantly influence downstream performance [50].