Workflow
RLVR(基于可验证奖励的强化学习)
icon
Search documents
Karpathy 2025年AI终极觉醒:我们还没发挥出LLM潜力的10%
3 6 Ke· 2025-12-22 00:29
Core Insights - 2025 is anticipated to be a pivotal year in the history of artificial intelligence, marking a transition from "impressive" in 2023 to "confusion" in 2024, and finally to "awakening" in 2025 [1][3] Group 1: RLVR Revolution - The traditional training process for large language models (LLMs) involves three stages: pre-training, supervised fine-tuning, and human feedback reinforcement learning (RLHF) [4][6] - RLHF has been criticized for training models to "appear to reason" rather than genuinely reasoning, leading to issues like "sycophancy" where models produce plausible but incorrect outputs [6][7] - The emergence of RLVR (Reinforcement Learning from Verifiable Rewards) represents a new phase where models are trained based on objective results rather than human feedback, allowing for a more robust learning process [7][12] - RLVR enables models to explore multiple reasoning paths and self-verify their outputs, leading to the development of reasoning capabilities without explicit instruction [18][19] - The shift in focus from training to inference time allows models to enhance their intelligence by spending more time on complex problems, akin to a student taking longer to solve difficult questions [21][23] Group 2: Philosophical Divide - A philosophical debate is presented regarding whether AI is creating new "animals" or "ghosts," with the latter referring to LLMs that lack continuous consciousness and are instead statistical constructs of human language [24][32] - Rich Sutton's "Bitter Lesson" suggests that methods leveraging unlimited computational power will ultimately outperform those relying on human knowledge, emphasizing the supremacy of computational approaches [27][28] - The current AI models are seen as "ghosts" that lack a continuous self and are instead reflections of human language, leading to a "uncanny valley" effect in interactions [33][35] Group 3: Vibe Coding - Vibe Coding represents a shift in programming paradigms where developers focus on intent rather than code details, allowing AI to generate code based on natural language descriptions [40][44] - The emergence of tools like MenuGen demonstrates the potential of Vibe Coding, where even experienced programmers can create applications without writing traditional code [44][45] - The competition between AI programming tools, such as Cursor and ClaudeCode, highlights the evolving landscape of AI-assisted development, with each offering different levels of integration and autonomy [45][46] Group 4: Paradigm Shift - The introduction of Google's Gemini Nano Banana signifies a major paradigm shift in computing, suggesting that LLMs will redefine user interface experiences beyond traditional text-based interactions [47][49] - The preference for visual and spatial information over text indicates a need for LLMs to evolve in how they communicate with users, moving towards more engaging formats [49][50] - The "jagged" intelligence of AI, where it excels in certain areas while failing in others, reflects the uneven distribution of training data and highlights the complexities of AI capabilities [51][52] Group 5: Future Outlook - The year 2025 is positioned as an exciting yet unpredictable time for LLMs, with the potential for significant advancements and untapped capabilities still remaining [53][55] - The belief in rapid development alongside the need for further work suggests a dynamic and evolving landscape in AI research and application [57][58]
拒绝“熵崩塌”和“熵爆炸”!这项研究让大模型学会“精确探索”,推理成绩飙升
量子位· 2025-10-13 08:47
Core Insights - The article discusses the advancements in large language models (LLMs) using a method called RLVR (Reinforcement Learning with Verifiable Rewards), which has led to significant breakthroughs in mathematical, coding, and scientific reasoning tasks since 2024 [1][2]. Group 1: Challenges in RLVR Training - RLVR faces a critical bottleneck known as the "exploration imbalance," where exploration can either be too limited, leading to entropy collapse, or too uncontrolled, resulting in entropy explosion [2][9]. - The traditional entropy regularization method encourages exploration but can lead to either rapid convergence to a deterministic strategy or chaotic outputs due to excessive uncertainty [6][10]. Group 2: Proposed Solution - SIREN - The research team introduced a Selective Entropy Regularization method (SIREN) that employs three mechanisms: defining the exploration range, focusing on key decision points, and stabilizing the training process [14][18]. - SIREN limits entropy calculations to a core set of high-probability tokens, ensuring that exploration occurs only within semantically reasonable candidates [14][15]. - It identifies key decision points in the generation sequence where entropy is significantly higher than average, concentrating exploration incentives on these critical areas [16]. - The method adjusts the entropy target to maintain it within a reasonable range, preventing training instability [17]. Group 3: Experimental Validation - Experimental results demonstrate that SIREN significantly improves performance across various models and datasets, achieving an average major accuracy (maj@k) of 54.6% on Qwen2.5-Math-7B, surpassing the strongest baseline by 4.8% [22][24]. - The effective exploration facilitated by SIREN leads to a fundamental change in performance compared to traditional entropy regularization methods [25][32]. - The research indicates that SIREN maintains diversity in answers and avoids confusion collapse, contributing to a smoother and more controllable training process [28][30]. Group 4: Future Implications - The study emphasizes the importance of stable, controllable, and efficient exploration in releasing the potential of large models and overcoming performance bottlenecks [35]. - The proposed selective exploration control mechanism offers a feasible solution for refining exploration strategies in future reasoning model training paradigms [35].
复旦、同济和港中文等重磅发布:强化学习在大语言模型全周期的全面综述
机器之心· 2025-09-30 23:49
Core Insights - The article discusses the significant advancements in reinforcement learning (RL) techniques that enhance the capabilities of large language models (LLMs), particularly in understanding human intent and following user instructions [2][3] - A comprehensive survey titled "Reinforcement Learning Meets Large Language Models" has been conducted by researchers from top institutions, summarizing the role of RL throughout the entire lifecycle of LLMs [2][3] Summary by Sections Overview of Reinforcement Learning in LLMs - The survey details the application strategies of RL in various stages of LLMs, including pre-training, alignment fine-tuning, and reinforcement reasoning [3][6] - It organizes existing datasets, evaluation benchmarks, and mainstream open-source tools and training frameworks relevant to RL fine-tuning, providing a clear reference for future research [3][6] Lifecycle of LLMs - The article systematically covers the complete application lifecycle of RL in LLMs, detailing the objectives, methods, and challenges faced at each stage from pre-training to reinforcement [11][12] - A classification overview of the operational methods of RL in LLMs is presented, highlighting the interconnections between different stages [5][6] Focus on Verifiable Rewards - The survey emphasizes the focus on Reinforcement Learning with Verifiable Rewards (RLVR), summarizing its applications in enhancing reasoning stability and accuracy in LLMs [7][9] - It discusses how RLVR optimizes the reasoning process and improves the model's adaptability to complex tasks through automatically verifiable reward mechanisms [7][9] Key Contributions - The article identifies three main contributions: a comprehensive lifecycle overview of RL applications in LLMs, a focus on advanced RLVR techniques, and the integration of key research resources essential for experiments and evaluations [9][11] - It provides valuable references for researchers interested in exploring RL in the context of LLMs [11][12] Challenges and Future Directions - Despite significant progress, challenges remain in scalability and training stability for large-scale RL applications in LLMs, which are still computationally intensive and often unstable [12][13] - Issues related to reward design and credit assignment, particularly in long-term reasoning, pose difficulties for model learning [12][13] - The article highlights the need for standardized datasets and evaluation benchmarks to facilitate comparison and validation of RL fine-tuning methods [12][13]