可验证奖励的强化学习(RLVR)
Search documents
Karpathy 2025 年度盘点:o3 是真正拐点,Cursor 证明了应用层比我们想象的要厚
Founder Park· 2025-12-20 08:59
文章转载自「赛博禅心」 Andrej Karpathy 在 X 上更新了一篇博客文章,回顾了 2025 年大模型发展。 在文章中,Karpathy 提到,2025 年,是 LLM 令人兴奋的一年。 LLM 正在作为一种全新的智能形态浮现,它们同时比我们预想的聪明得多,也比我们预想的蠢得多。 即便在当前的能力水平下,整个行业也远未实现其 10% 的潜力。 超 17000 人的「AI 产品市集」社群!不错过每一款有价值的 AI 应用。 邀请从业者、开发人员和创业者,飞书扫码加群: 进群后,你有机会得到: 01 可验证奖励的强化学习(RLVR), 与此同时,有太多的想法值得去尝试,从概念上看这个领域依然广阔开放。 正如我今年早些时候 在 Dwarkesh 播客中提到的 ,相信我们将继续见证快速而持续的进步,但同时仍有大量工作要做, 系好安全带。 以下是我个人认为最值得关注的几个「范式转变」,这些变化重塑了整个行业格局,也在概念上给我留下了深刻印象。 TLDR: ⬆️关注 Founder Park,最及时最干货的创业分享 成为新的训练主力 2025 年,可验证奖励的强化学习(RLVR)成为 LLM 训练的新主力环节; ...
让机器人「不仅会想,还能准确去做」,VLA-R1把「推理+行动」带进真实世界
机器之心· 2025-10-25 05:14
Core Insights - The article discusses the VLA-R1 model, which enhances reasoning in Vision-Language-Action (VLA) models by integrating chain-of-thought (CoT) supervision with reinforcement learning (RL) to improve both reasoning quality and execution accuracy [4][5]. Group 1: VLA-R1 Overview - VLA-R1 is a foundational model that emphasizes "reasoning first, then executing" [4]. - It combines CoT supervision with verifiable rewards from RL to optimize the reasoning and execution processes [4][5]. Group 2: Key Innovations - Two-stage training approach: The model first undergoes supervised fine-tuning (SFT) with explicit CoT supervision, followed by reinforcement learning based on GRPO to stabilize the transition from reasoning to action [6][8]. - Three types of verifiable rewards (RLVR) are introduced to ensure accurate perception, trajectory execution, and structured output [9][11]. - The VLA-CoT data engine generates a structured dataset of 13,000 visual-language-action samples to provide high-quality supervision signals for SFT [12][19]. Group 3: Experimental Results - VLA-R1 was evaluated across four levels: in-domain testing, out-of-domain testing, simulation platforms, and real robot experiments [16][17]. - In the in-domain benchmark, VLA-R1 achieved a perception IoU of 36.51, improving by 17.78% over the baseline [22]. - In real robot experiments, VLA-R1 demonstrated a success rate of 62.5% for affordance perception and 75% for trajectory execution under various environmental complexities [26]. Group 4: Applications - VLA-R1 is applicable in home automation tasks, such as object retrieval and organization in cluttered environments, by effectively reasoning through similar targets and multiple container options [34]. - It can also be utilized in warehouse picking and light industrial assembly processes, where it clarifies the relationships between parts, tools, and containers [34]. - The model's structured output format is suitable for educational demonstrations and automated assessments, allowing for easy evaluation of reasoning and execution steps [34].
监督学习未死,一题训练五小时起飞!华人学者新方法20倍训练效率释放大模型推理能力
量子位· 2025-08-04 07:00
Core Viewpoint - The article discusses the breakthrough of One-Shot Critique Fine-Tuning (One-Shot CFT) in enhancing reasoning capabilities of large language models (LLMs) with minimal data and computational resources, outperforming traditional reinforcement learning (RL) methods and small-scale supervised fine-tuning (SFT) approaches [1][3][14]. Group 1: One-Shot CFT Methodology - One-Shot CFT is a new method that allows models to learn reasoning by analyzing the quality of answers rather than merely imitating them, thus providing a deeper learning signal [3][12]. - The process involves selecting a representative task, generating multiple answers using various models, and then having a more powerful model critique these answers, which serves as the supervision signal for training [4][5]. - The entire training process requires only one question, multiple answers, and critiques, taking approximately 5 GPU hours, significantly less than RL methods [5][14]. Group 2: Performance and Results - In experiments, Qwen2.5-Math-7B achieved a 15% accuracy increase after One-Shot CFT fine-tuning on a single question, surpassing both RL and full supervised fine-tuning models that used tens of thousands of training samples [9][10]. - The method demonstrated strong performance across various mathematical and logical reasoning tasks, with accuracy improvements ranging from 10% to 16% in specific sub-tasks [10][11]. - One-Shot CFT showed stability and reproducibility across different tasks and model configurations, indicating its robustness [11][13]. Group 3: Advantages of One-Shot CFT - The method emphasizes critical learning, allowing models to understand why answers are correct or incorrect, which enhances the depth of learning compared to traditional SFT [12]. - It introduces multi-perspective inputs by generating multiple answers and critiques for a single task, closely mimicking human learning processes [12]. - The training signals from critiques are highly generalizable, reducing the risk of overfitting and allowing for easier transfer to new tasks [12]. Group 4: Accessibility and Practical Implications - One-Shot CFT's low computational cost makes it accessible for individual researchers, resource-limited labs, and startups, providing a cost-effective solution for enhancing reasoning capabilities [14][15]. - The entire process is open-source, including training scripts, model parameters, and datasets, which significantly lowers the barrier for replication and experimentation [17].