Workflow
直接偏好优化(DPO)
icon
Search documents
过去一个月高强度RL的实践和思考 - 如何涨点?
自动驾驶之心· 2025-10-19 23:32
Core Insights - The article discusses the recent advancements and challenges in Reinforcement Learning (RL) for Visual Language Models (VLM), emphasizing the importance of foundational work and iterative improvements in achieving performance gains [2][4]. RL Goals - The primary objectives for RL in VLM include achieving a 1-2 point increase in overall performance on SFT model versions and exceeding 1-2 points in specific benchmarks such as mathematics and instruction adherence [5]. RL Overall Approach - The essence of RL is to enhance sampling efficiency rather than enabling the base model to learn new knowledge. It is noted that the base model can outperform RL models in terms of correct response probability when given unlimited attempts [7][8]. Challenges in VLM RL - Key challenges include the selection of efficient RL algorithms, the need for high infrastructure requirements, and the sensitivity of RL to data quality and organization [10][12]. Data Organization - Effective data organization is crucial, requiring a balanced mix of tasks and high-quality input data. The output length is also significantly related to the RL algorithm used, necessitating careful consideration of training data characteristics [13][14]. Key Findings and Conclusions - Short responses negatively impact training effectiveness, and it is essential to construct pairs of responses with clear distinctions between acceptable and rejectable outputs. The importance of meticulous data checking and the absence of a "silver bullet" solution are emphasized [19][24].
理想PhysGM:前馈式从单张图片30秒生成4D内容
理想TOP2· 2025-09-02 06:35
Core Viewpoint - The article discusses the innovative PhysGM framework, which transforms 4D generation from an optimization problem into an inference problem, allowing for rapid and efficient generation of 4D simulations from a single image [1][2]. Group 1: Advantages of PhysGM - PhysGM significantly improves speed, generating results in under 30 seconds compared to previous methods that could take hours [3][9]. - The framework simplifies the process by eliminating the need for pre-processing and iterative scene optimization [3][9]. - It enhances physical realism and visual quality in the generated simulations [3][9]. - PhysGM does not rely on large language models, making it more accessible and scalable [3][9]. Group 2: Potential Limitations - There may be limitations in generalization, particularly for non-rigid objects, and the current model predicts only a single aggregate physical property vector [4]. - The performance of the model is constrained by the underlying models used for 3D reconstruction, which may lead to loss of geometric details or inconsistencies in texture [4][6]. Group 3: Training Strategy - The training consists of two phases: supervised pre-training to establish physical priors and DPO-based fine-tuning to align the model with real-world simulations [7][8]. - The first phase involves creating a dataset of over 24,000 3D assets, using a dual-head U-Net architecture to predict geometric and physical parameters [7]. - The second phase utilizes Direct Preference Optimization (DPO) to refine the model based on the quality of generated simulations compared to real reference videos [8]. Group 4: Comparison with Other Methods - PhysGM outperforms several existing methods across multiple dimensions, including the need for pre-processing, automation of parameter computation, generalizability, reliance on large language models, and inference time [9].
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
中科院自动化所!视觉-触觉-语言-动作模型方案与数据集制作分享
具身智能之心· 2025-07-30 00:02
Core Viewpoint - The article discusses the development of a Vision-Tactile-Language-Action (VTLA) model aimed at enhancing robot manipulation tasks, particularly in contact-intensive scenarios, by integrating visual and tactile inputs with language instructions [2]. Group 1: Model Development - The VTLA framework addresses the gap in applying visual language models (VLM) to language-conditioned robotic operations, especially beyond visually dominated tasks [2]. - A low-cost multimodal dataset was created in a simulated environment, specifically designed for fingertip insertion tasks, which includes visual-tactile-action-instruction pairs [2]. Group 2: Performance and Results - The VTLA model achieved over 90% success rate on unknown hole types, significantly outperforming traditional imitation learning methods and existing multimodal baselines [2]. - The model's capability was validated through real-world hole axis assembly experiments, demonstrating its superior simulation-to-reality (Sim2Real) transfer ability [2].