Workflow
RLVR(基于可验证奖励的强化学习)
icon
Search documents
Karpathy 2025年AI终极觉醒:我们还没发挥出LLM潜力的10%
3 6 Ke· 2025-12-22 00:29
2025年,注定是人工智能历史上被铭记的一年。 如果说2023年是「惊艳」(ChatGPT的横空出世),2024年是「迷茫」(在大模型落地的憧憬中探索),那么在Andrej Karpathy的笔下,2025则是「觉 醒」的一年。 Karpathy一直以来都是AI界的「顶流」布道者。 他的年终总结不仅仅是一篇技术回顾,更像是一部微缩的编年史,记录了LLM如何从「模仿人类的鹦鹉」进化到了「召唤理性的幽灵」。 他以极其敏锐的视角,捕捉到了AI进化的核心:RLVR(基于可验证奖励的强化学习)的崛起、Vibe Coding(氛围编码)的流行、以及那个令人深思的 哲学隐喻: 创造AI,我们到底是在制造一种新的物种,还是在召唤幽灵? 这一次让我们剥茧抽丝,深度解析Karpathy提到的每一个范式转移。 穿过技术术语的迷雾,直抵智能进化的本质,呈现一个真实、疯狂且充满「参差感」的AI-2025年。 第一章:RLVR革命 从「讨好人类」到「追求真理」 在2025年之前,训练一个大语言模型(LLM)的通常包含三道工序: 1. 预训练(Pre-training): 让模型阅读整个互联网,学会预测下一个token。这是「博学」的阶段 ...
拒绝“熵崩塌”和“熵爆炸”!这项研究让大模型学会“精确探索”,推理成绩飙升
量子位· 2025-10-13 08:47
Core Insights - The article discusses the advancements in large language models (LLMs) using a method called RLVR (Reinforcement Learning with Verifiable Rewards), which has led to significant breakthroughs in mathematical, coding, and scientific reasoning tasks since 2024 [1][2]. Group 1: Challenges in RLVR Training - RLVR faces a critical bottleneck known as the "exploration imbalance," where exploration can either be too limited, leading to entropy collapse, or too uncontrolled, resulting in entropy explosion [2][9]. - The traditional entropy regularization method encourages exploration but can lead to either rapid convergence to a deterministic strategy or chaotic outputs due to excessive uncertainty [6][10]. Group 2: Proposed Solution - SIREN - The research team introduced a Selective Entropy Regularization method (SIREN) that employs three mechanisms: defining the exploration range, focusing on key decision points, and stabilizing the training process [14][18]. - SIREN limits entropy calculations to a core set of high-probability tokens, ensuring that exploration occurs only within semantically reasonable candidates [14][15]. - It identifies key decision points in the generation sequence where entropy is significantly higher than average, concentrating exploration incentives on these critical areas [16]. - The method adjusts the entropy target to maintain it within a reasonable range, preventing training instability [17]. Group 3: Experimental Validation - Experimental results demonstrate that SIREN significantly improves performance across various models and datasets, achieving an average major accuracy (maj@k) of 54.6% on Qwen2.5-Math-7B, surpassing the strongest baseline by 4.8% [22][24]. - The effective exploration facilitated by SIREN leads to a fundamental change in performance compared to traditional entropy regularization methods [25][32]. - The research indicates that SIREN maintains diversity in answers and avoids confusion collapse, contributing to a smoother and more controllable training process [28][30]. Group 4: Future Implications - The study emphasizes the importance of stable, controllable, and efficient exploration in releasing the potential of large models and overcoming performance bottlenecks [35]. - The proposed selective exploration control mechanism offers a feasible solution for refining exploration strategies in future reasoning model training paradigms [35].
复旦、同济和港中文等重磅发布:强化学习在大语言模型全周期的全面综述
机器之心· 2025-09-30 23:49
Core Insights - The article discusses the significant advancements in reinforcement learning (RL) techniques that enhance the capabilities of large language models (LLMs), particularly in understanding human intent and following user instructions [2][3] - A comprehensive survey titled "Reinforcement Learning Meets Large Language Models" has been conducted by researchers from top institutions, summarizing the role of RL throughout the entire lifecycle of LLMs [2][3] Summary by Sections Overview of Reinforcement Learning in LLMs - The survey details the application strategies of RL in various stages of LLMs, including pre-training, alignment fine-tuning, and reinforcement reasoning [3][6] - It organizes existing datasets, evaluation benchmarks, and mainstream open-source tools and training frameworks relevant to RL fine-tuning, providing a clear reference for future research [3][6] Lifecycle of LLMs - The article systematically covers the complete application lifecycle of RL in LLMs, detailing the objectives, methods, and challenges faced at each stage from pre-training to reinforcement [11][12] - A classification overview of the operational methods of RL in LLMs is presented, highlighting the interconnections between different stages [5][6] Focus on Verifiable Rewards - The survey emphasizes the focus on Reinforcement Learning with Verifiable Rewards (RLVR), summarizing its applications in enhancing reasoning stability and accuracy in LLMs [7][9] - It discusses how RLVR optimizes the reasoning process and improves the model's adaptability to complex tasks through automatically verifiable reward mechanisms [7][9] Key Contributions - The article identifies three main contributions: a comprehensive lifecycle overview of RL applications in LLMs, a focus on advanced RLVR techniques, and the integration of key research resources essential for experiments and evaluations [9][11] - It provides valuable references for researchers interested in exploring RL in the context of LLMs [11][12] Challenges and Future Directions - Despite significant progress, challenges remain in scalability and training stability for large-scale RL applications in LLMs, which are still computationally intensive and often unstable [12][13] - Issues related to reward design and credit assignment, particularly in long-term reasoning, pose difficulties for model learning [12][13] - The article highlights the need for standardized datasets and evaluation benchmarks to facilitate comparison and validation of RL fine-tuning methods [12][13]