Workflow
可验证奖励强化学习(RLVR)
icon
Search documents
大语言模型2025这一年
Core Insights - The large language model (LLM) industry has seen significant development by 2025, with companies like DeepSeek emerging as strong competitors through open-source strategies and advanced reasoning capabilities [1] - Major players such as OpenAI, Google, Tencent, Alibaba, and ByteDance continue to compete in technology, application, and ecosystem development, leveraging their advantages in user acquisition and problem-solving [1] Group 1: Company Developments - DeepSeek has made notable advancements with its DeepSeek-V3 model, which features a total of 671 billion parameters and excels in mathematical reasoning and code generation, competing with closed-source models like GPT-4o [2] - The introduction of DeepSeek-V3.2 aims to balance reasoning capabilities and output length, while DeepSeek-V3.2-Speciale pushes the limits of reasoning ability [3] - ByteDance's Doubao model has achieved a daily token usage exceeding 50 trillion, making it the leading AI model in China and the third globally [3] Group 2: Technological Innovations - Tencent's mixed Yuan model has progressed from technical breakthroughs to comprehensive ecosystem applications, showcasing a clear path from technology to practical implementation [2] - The Qwen2.5-VL-32B-Instruct model utilizes a unified Transformer architecture, improving cross-modal generation accuracy by over 40% [4] - Zhizhu AI has doubled its parameter scale from 5 trillion to 10 trillion, achieving a reasoning accuracy of 98.5%, nearing international standards [4] Group 3: Future Trends - The future of LLMs is characterized by becoming "smarter, more vertical, and closer to life," transitioning from technical breakthroughs to deep applications across various fields [7] - The rise of localized intelligent agents, such as Anthropic's Claude Code, allows for low-latency interactions by deploying directly on user devices [8] - The industry is expected to see significant advancements in embodied intelligence applications, which combine physical AI with large models, aligning with national development goals [9]
大模型的2025:6个关键洞察
3 6 Ke· 2025-12-23 11:39
Core Insights - The report titled "2025 LLM Year in Review" by Andrej Karpathy highlights a significant paradigm shift in the field of large language models (LLMs) from mere "probabilistic imitation" to "logical reasoning" [1][2] - The driving force behind this transition is the maturity of Reinforcement Learning with Verifiable Rewards (RLVR), which encourages models to generate reasoning traces similar to human thought processes [1][2] - Karpathy emphasizes that the potential of this new computational paradigm has yet to be fully explored, with current utilization estimated at less than 10% [2][15] Technological Developments - In 2025, RLVR emerged as the core new phase in the training stack for production-grade LLMs, allowing models to autonomously develop reasoning strategies through training in verifiable environments [4][5] - The year saw a significant extension in the training cycles of models, although the overall parameter scale remained largely unchanged [5] - The introduction of the o1 model at the end of 2024 and the o3 model in early 2025 marked a qualitative leap in LLM capabilities [5] Nature of Intelligence - Karpathy argues that LLMs should be viewed as "summoned ghosts" rather than "evolving animals," indicating a fundamental difference in their intelligence structure compared to biological entities [2][6] - The performance of LLMs exhibits a "zigzag" characteristic, excelling in advanced areas while struggling with basic common knowledge [2][8] New Applications and Trends - The rise of "Vibe Coding" and the practical trend of localized intelligent agents are discussed, indicating a shift towards more user-centric AI applications [2][9] - The emergence of tools like Cursor highlights a new application layer for LLMs, focusing on context engineering and optimizing model interactions for specific verticals [9] User Interaction and Development - The introduction of Claude Code (CC) showcases the capabilities of LLM agents, emphasizing local deployment for enhanced user interaction and access to private data [10][11] - The concept of "atmospheric programming" allows users to create powerful programs using natural language, democratizing programming skills [12][13] Future Outlook - The report suggests that the industry is on the brink of a transition from simulating human intelligence to achieving pure machine intelligence, with future competition focusing on efficient AI reasoning [2][15] - The potential for innovation in the LLM space remains vast, with many ideas yet to be explored [15]
大模型的2025:6个关键洞察
腾讯研究院· 2025-12-23 08:33
Core Insights - The article discusses a significant paradigm shift in the field of large language models (LLMs) in 2025, moving from "probabilistic imitation" to "logical reasoning" driven by the maturity of verifiable reward reinforcement learning (RLVR) [2][3] - The author emphasizes that the potential of LLMs has only been explored to less than 10%, indicating vast future development opportunities [3][25] Group 1: Technological Advancements - In 2025, RLVR emerged as the core new phase in training LLMs, allowing models to autonomously generate reasoning traces by training in environments with verifiable rewards [7][8] - The increase in model capabilities in 2025 was primarily due to the exploration and release of the "stock potential" of RLVR, rather than significant changes in model parameter sizes [8][9] - The introduction of the o1 model at the end of 2024 and the o3 model in early 2025 marked a qualitative leap in LLM capabilities [9] Group 2: Nature of Intelligence - The author argues that LLMs should be viewed as "summoned ghosts" rather than "evolving animals," highlighting a fundamental difference in their intelligence compared to biological entities [10][11] - The performance of LLMs exhibits a "sawtooth" characteristic, excelling in advanced fields while struggling with basic common knowledge [12][13] Group 3: New Applications and Interfaces - The emergence of Cursor represents a new application layer for LLMs, focusing on context engineering and optimizing prompt design for specific verticals [15] - The introduction of Claude Code (CC) demonstrated the core capabilities of LLM agents, operating locally on user devices and accessing private data [17][18] - The concept of "atmospheric programming" allows users to create powerful programs using natural language, democratizing programming skills [20][21] Group 4: Future Directions - The article suggests that the future of LLMs will involve a shift towards visual and interactive interfaces, moving beyond text-based interactions [24] - The potential for innovation in the LLM space remains vast, with many ideas yet to be explored, indicating a continuous evolution in the industry [25]
大模型的2025:6个关键洞察,来自OpenAI创始人、AI大神“AK”
3 6 Ke· 2025-12-22 04:22
Core Insights - The report by Andrej Karpathy highlights a significant paradigm shift in the field of large language models (LLMs) from "probabilistic imitation" to "logical reasoning" in 2025, driven by the maturation of Reinforcement Learning with Verifiable Rewards (RLVR) [1][2] - The industry is at a critical juncture, transitioning from "simulating human intelligence" to "pure machine intelligence," with a focus on how to make AI think efficiently rather than just competing on computational power [2][4] Group 1: Technological Advancements - RLVR has emerged as the core new phase in LLM training, allowing models to autonomously generate reasoning traces by training in environments with verifiable rewards [4][5] - The year 2025 saw a significant extension in the training cycles of LLMs, with the ability to optimize for longer reasoning traces and increased "thinking time," leading to qualitative leaps in model capabilities [5][6] Group 2: Nature of Intelligence - Karpathy argues that LLMs should be viewed as "summoned ghosts" rather than "evolving animals," indicating a fundamental difference in the nature of AI intelligence compared to biological intelligence [6][7] - The performance of LLMs exhibits a "zigzag" characteristic, excelling in specialized areas while struggling with basic common knowledge, reflecting their unique intelligence structure [8] Group 3: New Applications and Interfaces - The emergence of applications like Cursor signifies a new layer in LLM usage, focusing on context engineering and optimizing the orchestration of multiple LLM calls for specific vertical domains [9][10] - The introduction of Claude Code (CC) demonstrates the potential of LLM agents to operate locally on user devices, accessing private data and providing a new paradigm of AI interaction [10][11] Group 4: Programming and Development - The concept of "vibe coding" has gained traction, allowing individuals to create powerful programs using natural language, thus democratizing programming skills beyond trained professionals [11][12] - The shift towards atmosphere programming is expected to transform the software development ecosystem, making coding more accessible and flexible for everyday users [12][13] Group 5: Future Prospects - Despite the rapid advancements, the industry has only tapped into less than 10% of the potential of LLMs, indicating vast opportunities for future exploration and innovation [14][15] - The report emphasizes the need for foundational work to continue alongside the rapid development of LLM technologies, suggesting a sustained period of transformation ahead [14][15]
近两百万人围观的Karpathy年终大语言模型清单,主角是它们
机器之心· 2025-12-21 03:01
Core Insights - 2025 is a pivotal year for the evolution of large language models (LLMs), marked by significant paradigm shifts and advancements in the field [2][36] - The emergence of Reinforcement Learning from Verifiable Rewards (RLVR) is transforming LLM training processes, leading to enhanced capabilities without necessarily increasing model size [10][11] - The industry is witnessing a new layer of LLM applications, exemplified by tools like Cursor, which organize and deploy LLM capabilities in specific verticals [16][17] Group 1: Reinforcement Learning and Model Training - The introduction of RLVR allows models to learn in verifiable environments, enhancing their problem-solving strategies through self-optimization [10] - The majority of capability improvements in 2025 stem from extended RL training rather than increased model size, indicating a new scaling law [11][12] - OpenAI's models, such as o1 and o3, exemplify the practical application of RLVR, showcasing a significant qualitative leap in performance [12] Group 2: Understanding LLM Intelligence - The industry is beginning to grasp the unique nature of LLM intelligence, which differs fundamentally from human intelligence, leading to a jagged distribution of capabilities [14][15] - The concept of "vibe coding" emerges, allowing non-engineers to create complex programs, thus democratizing programming and reshaping software development roles [25][29] - The introduction of tools like Claude Code signifies a shift towards LLM agents that can operate locally, enhancing user interaction and productivity [19][22] Group 3: User Interaction and GUI Development - The development of GUI applications like Google Gemini's "Nano Banana" indicates a trend towards more intuitive and visually engaging interactions with LLMs [31][34] - The integration of text, images, and knowledge within a single model represents a significant advancement in how LLMs can communicate and operate [34] - The industry is at the cusp of a new interaction paradigm, moving beyond traditional web-based AI to more integrated and user-friendly applications [23][30] Group 4: Future Outlook - The potential of LLMs remains largely untapped, with the industry only beginning to explore their capabilities [38][39] - Continuous and rapid advancements are expected, alongside the recognition of the extensive work still required to fully realize the potential of LLM technology [40][41]
卡帕西2025大模型总结火爆硅谷
量子位· 2025-12-20 04:20
Core Insights - The article discusses the emerging trends in AI for 2025, highlighting the transformative impact of large models and the belief that only 10% of their potential has been realized so far [6][7]. Group 1: Key Predictions and Trends - The introduction of RLVR (Reinforcement Learning with Verified Rewards) marks a new phase in training large models, allowing them to develop reasoning strategies autonomously [8][10]. - The performance of large models is expected to exhibit a "zigzag" characteristic, indicating rapid bursts of capability as RLVR is adopted [18]. - Cursor represents a new application layer for large models, suggesting a shift towards more integrated and user-friendly AI applications [23][24]. Group 2: Innovations in AI Applications - Claude Code is identified as a significant example of a large model agent, capable of running locally on personal computers and utilizing user-specific data [26][32]. - Vibe Coding is anticipated to democratize programming, enabling non-professionals to create software through natural language [34][37]. - Nano Banana is highlighted as a groundbreaking model that integrates text generation, image generation, and world knowledge, setting a new standard for user interface and experience in AI [40][43].
这些大神在Meta的论文看一篇少一篇了
3 6 Ke· 2025-11-17 09:52
Core Insights - The article discusses a perplexing phenomenon in large model reinforcement learning (RL) training, where significant performance improvements occur despite minimal parameter changes [1][3]. Group 1: Research Findings - The paper analyzes the dynamics of Verifiable Reward Reinforcement Learning (RLVR) training, debunking the misconception that sparse parameter updates are merely superficial; instead, it reveals a fixed optimization bias inherent in RLVR [3][5]. - The research introduces a new framework called the Three-Gate Theory, which explains how RLVR parameter updates are directed towards specific parameter regions [5][7]. Group 2: Parameter Update Characteristics - The study highlights a paradox where RL training yields high performance gains with sparse parameter updates, contrasting with the dense updates seen in supervised fine-tuning (SFT) [5][6]. - The sparsity of updates in RL training ranges from 36% to 92%, while SFT shows sparsity between 0.6% and 18.8%, indicating a significant difference in update density [5][6]. Group 3: Three-Gate Theory Components - The first gate, KL Anchoring, ensures that RL updates do not deviate significantly from the model's original output style, maintaining a small drift in parameter movement [8]. - The second gate, Model Geometry, indicates that RL updates prefer low-curvature directions in the optimization landscape, preserving the model's original weight structure [9]. - The third gate, Precision, explains that the limited precision of bfloat16 can mask small updates in RL, leading to the appearance of sparsity [11]. Group 4: Implications for Parameter Efficient Fine-Tuning - The findings suggest that many parameter-efficient fine-tuning (PEFT) methods from the SFT era do not transfer well to RLVR, particularly those aligned with sparse or low-rank priors [17]. - The study indicates that updating non-principal, low-amplitude weights aligns better with RLVR's optimization trajectory, while methods like PiSSA may not provide additional benefits and can lead to instability [17].
不改模型也能提升推理性能?ICLR投稿提出测试时扩展新范式OTV
量子位· 2025-10-23 00:08
Core Insights - The article discusses the challenges faced by large language models, including hallucinations, logical errors, and reasoning flaws, prompting researchers to explore new methods to enhance output reliability [1] - A novel approach called One-Token Verification (OTV) is introduced, which allows models to monitor their reasoning process in real-time without altering the original model structure or parameters [2] Summary by Sections Current Mainstream Paradigms - LoRA fine-tuning is highlighted as a popular parameter-efficient tuning method that avoids full parameter training and is easy to deploy, but it often relies on detailed supervised data and can lead to "forgetting effects" [3] - Quality screening of generated results can enhance output credibility but tends to be reactive, making it difficult to correct the model's reasoning in real-time and lacking insight into the internal reasoning process [4] Parallel Thinking Framework - The article introduces the concept of Parallel Thinking, which allows language models to generate multiple reasoning paths simultaneously and then filter them through a specific mechanism [5] - OTV builds on this framework by focusing on efficiently selecting correct reasoning paths at a lower cost rather than generating multiple paths [5] OTV Mechanism - OTV employs an internal verifier that analyzes the reasoning process using a lightweight role vector implemented via LoRA, running in parallel with the original model [9] - The internal verifier utilizes the key-value cache (KV Cache) from the Transformer architecture to capture rich information about the model's internal dynamics during the reasoning process [9] - A special token, referred to as "Token of Truth" (ToT), is inserted during the verification phase to assess the correctness of the reasoning path [9] Training and Efficiency - OTV's internal verifier is designed to be lightweight, with a training logic that assigns heuristic pseudo-labels based on the correctness of the final answer [10] - The training process is highly parallelized, allowing simultaneous scoring predictions for all positions, making it computationally comparable to traditional LoRA fine-tuning [10] Experimental Validation - OTV was systematically evaluated on various open-source models, demonstrating superior accuracy and a preference for shorter, more accurate reasoning paths compared to baseline methods [14] - The results indicate that OTV can read the internal reasoning state and output quality, significantly outperforming general methods that rely solely on output text [15] Dynamic Control of Computational Costs - OTV enables models to dynamically control computational expenses by real-time elimination of low-quality paths based on confidence scores, leading to a reduction in computational load by nearly 90% while maintaining optimal accuracy [17] Future Prospects - The OTV framework opens avenues for deeper integration with original models and the potential for a three-state system that includes "uncertain" states, enhancing selective prediction capabilities [25][26] - The approach could also be extended to different model architectures, optimizing KV cache structures to further improve reasoning efficiency and representation utilization [26]
OpenAI路线遭质疑,Meta研究员:根本无法构建超级智能
3 6 Ke· 2025-06-20 12:00
Core Insights - The pursuit of "superintelligence" represents a significant ambition among leading AI companies like Meta, OpenAI, and Google DeepMind, with substantial investments being made in this direction [1][3][4] - Sam Altman of OpenAI suggests that building superintelligence is primarily an engineering challenge, indicating a belief in a feasible path to achieve it [3][4] - Meta AI researcher Jack Morris argues that the current approach of using large language models (LLMs) and reinforcement learning (RL) may not be sufficient to construct superintelligence [1][2] Group 1: Current Approaches and Challenges - Morris outlines three potential methods for building superintelligence: purely supervised learning (SL), RL from human validators, and RL from automated validators [2] - The integration of non-text data into models is believed not to enhance overall performance, as human-written text carries intrinsic value that sensory inputs do not [2][6] - The concept of a "data wall" or "token crisis" is emerging, where the availability of text data for training LLMs is becoming a concern, leading to extensive efforts to scrape and transcribe data from various sources [8][19] Group 2: Learning Algorithms and Their Implications - The two primary learning methods identified for potential superintelligence are SL and RL, with SL being more stable and efficient for initial training [10][22] - The hypothesis that superintelligence could emerge from SL alone is challenged by the limitations of current models, which may not exhibit human-level general intelligence despite excelling in specific tasks [15][16] - The combination of SL and RL is proposed as a more viable path, leveraging human feedback or automated systems to refine model outputs [20][22][28] Group 3: Future Directions and Speculations - The potential for RL to effectively transfer learning across various tasks remains uncertain, raising questions about the scalability of this approach to achieve superintelligence [34] - The competitive landscape among AI companies is likely to intensify as they seek to develop the most effective training environments for LLMs, potentially leading to breakthroughs in superintelligence [34]
LLM加RL遭质疑:故意用错奖励,数学基准也显著提升,AI圈炸了
机器之心· 2025-05-28 08:09
Core Insights - The article discusses a recent paper that challenges the effectiveness of reinforcement learning (RL) in training large language models (LLMs), particularly in the context of using false rewards to enhance performance [3][4][5]. Group 1: Findings on Reinforcement Learning - The study reveals that using false rewards, including random and incorrect rewards, can significantly improve the performance of the Qwen2.5-Math-7B model on the MATH-500 benchmark, with random rewards improving scores by 21% and incorrect rewards by 25% compared to a 28.8% improvement with true rewards [5][10]. - The research questions the traditional belief that high-quality supervision signals are essential for effective RL training, suggesting that even minimal or misleading signals can yield substantial improvements [7][19]. Group 2: Model-Specific Observations - The effectiveness of RL with false rewards appears to be model-dependent, as other models like Llama3 and OLMo2 did not show similar performance gains when subjected to false rewards [16][17]. - The Qwen model demonstrated a unique ability to leverage code generation for mathematical reasoning, achieving a code generation frequency of 65% prior to RL training, which increased to over 90% post-training [28][34]. Group 3: Implications for Future Research - The findings indicate that future RL research should explore the applicability of these methods across diverse model families, rather than relying solely on a single model's performance [25][49]. - Understanding the pre-existing reasoning patterns learned during pre-training is crucial for designing effective RL training strategies, as these patterns significantly influence downstream performance [50].