Workflow
带可验证奖励的强化学习(RLVR)
icon
Search documents
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
仅需1个数据,就能让大模型的数学推理性能大大增强?
机器之心· 2025-05-09 09:02
Core Insights - The article discusses significant advancements in large language models (LLMs) regarding reasoning capabilities, particularly in complex mathematical tasks, driven by Reinforcement Learning with Verifiable Reward (RLVR) [1][2]. Group 1: Research Findings - Researchers from the University of Washington and Microsoft found that using just one training data point (1-shot RLVR) can significantly enhance model performance in various mathematical reasoning tasks [2][3]. - The performance of Qwen2.5-Math-1.5B improved from 36.0% to 73.6% and Qwen2.5-Math-7B from 51.0% to 79.2% on the MATH500 dataset using 1-shot RLVR, achieving results comparable to using a larger dataset of 1.2k [3][13]. - The 1-shot RLVR approach also demonstrated effectiveness in non-mathematical reasoning tasks, such as ARC-Easy and ARC-Challenge [5]. Group 2: Methodology and Data Selection - The study employed a combination of policy gradient loss, KL divergence loss, and entropy loss in the training process, with a focus on policy gradient loss as the primary driver of improvement [7][19]. - Researchers utilized a metric called historical variance score to prioritize data selection from the dataset, although this method was not deemed optimal [8][19]. - The findings indicated that 1-shot RLVR could generalize well across different mathematical themes, suggesting that a single training example from one topic could enhance performance in others [13][16]. Group 3: Observations and Implications - The phenomenon of saturation and generalization was observed, where training accuracy approached 100% quickly, but downstream task performance continued to improve [10][11]. - The study highlighted the importance of encouraging exploration through entropy loss, which contributed to better performance in 1-shot RLVR [20]. - The results support previous conclusions that foundational models used for RLVR often possess inherent reasoning capabilities that can be activated with minimal data [22].