Workflow
带可验证奖励的强化学习(RLVR)
icon
Search documents
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
图 1:代表性视觉强化学习模型时间线。该图按时间顺序概述了 2023 年至 2025 年的关键视觉强化学习(Visual RL)模型,并将其分为四个领域:多模态大语 言模型(Multimodal LLM)、视觉生成(Visual Generation)、统一模型(Unified Models)和视觉 - 语言 - 动作模型(VLA Models)。 在 大语言模型(LLM) 的江湖里, 强化学习(RL) ,特别是带有 人类反馈的强化学习(RLHF) ,早已不是什么新鲜词。正是它,如同一位内 力深厚的宗师,为 GPT、Qwen、DeepSeek 等模型注入了"灵魂",使其回答能够如此贴合人类的思维与价值观。这场由 RL 主导的革命,彻底改变 了我们与AI的交互方式。 然而,当所有人都以为强化学习的舞台仅限于文字的方寸之间时,一股同样的浪潮,正以迅雷不及掩耳之势,"卷"向了另一个更为广阔的领域—— 计算机视觉(CV) 。 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我 -> 领取大模型巨卷干货 >> 点击进入→ 大模型技术 交流群 本文只做学术分享,如有侵权,联系删文 写在前面 当RLHF"卷入"计 ...
仅需1个数据,就能让大模型的数学推理性能大大增强?
机器之心· 2025-05-09 09:02
论文发现,只在 RLVR 训练中使用一个训练数据(称作 1-shot RLVR),就可以在 MATH500 上,将 Qwen2.5-Math-1.5B 的表现从 36.0% 提升到 73.6%,以及把 Qwen2.5-Math-7B 的表现从 51.0% 提升到 79.2% 。 这个表现和使用 1.2k 数据集(包括这一个数据)的 RLVR 效果差不多。 使用两个训练样本的 RLVR 甚至略微超过了使用 1.2k 数据集(称作 DSR-sub)的表现, 和使用 7.5k MATH 训练集的 RLVR 表现相当。这种表现可以在 6 个常用的数学推理任务上都可以观察到。 本文第一作者王宜平是华盛顿大学的博士生,其导师、通讯作者杜少雷为华盛顿大学Assistant Professor;另外两位通讯作者 Yelong Shen 和 Shuohang Wang 是 Microsoft GenAI 的Principal Researcher。 最近, 大型语言模型(LLM)在推理能力方面取得了显著进展,特别是在复杂数学任务上。推动上述进步的关键方法之一就是带可验证奖励的强化学习 (Reinforcement Learni ...