Workflow
DPO
icon
Search documents
对VLA的RL最新进展的梳理~
自动驾驶之心· 2025-07-03 12:41
Core Viewpoint - The article discusses the recent advancements in Vision-Language-Action (VLA) models, particularly focusing on the integration of Reinforcement Learning (RL) techniques to enhance their performance and stability in various tasks [1]. Group 1: Early Exploration of iRe-VLA - The core algorithm of iRe-VLA is PPO, which introduces a two-stage training paradigm to address instability in online reinforcement learning [2]. - The implementation utilizes BLIP-2 3B as the VLM backbone, replacing the final fully connected layer with an action head that includes a token learner and an MLP [2]. - The experimental setup involves simulation environments like Meatworld and Franka Kitchen, with tasks divided into three categories for evaluation [2]. Group 2: Preference Alignment with GRAPE - GRAPE introduces preference alignment into VLA training, specifically designed for VLA characteristics [6]. - The reward for each trajectory is composed of three parts: success reward, self-reward, and external reward based on a custom cost function [8]. - The external reward is calculated by decomposing trajectories into stages and evaluating them using a VLM task decomposer [9]. Group 3: LOOP and RIPT-VLA - LOOP combines RLOO and PPO to address challenges in sparse rewards and long sequences in multi-task scenarios [11]. - The RIPT-VLA employs the LOOP algorithm for online RL and provides open-source code for implementation [13]. - The approach includes various tricks to enhance training efficiency, such as dynamic rejection mechanisms and multi-task sampling [15]. Group 4: System and Algorithm Innovations in RL4VLA - RL4VLA models the action generation process as a multi-modal dialogue, using PPO training with dense pseudo-rewards to guide the training process [18]. - The training involves a Robotic Process Reward Model that predicts the likelihood of action sequences, enhancing the reward signal [20]. - The article emphasizes adaptive curriculum selection strategies to improve sample efficiency and generalization capabilities [21][23]. Group 5: Engineering Challenges and Future Directions - The article highlights the need for new RL algorithms suitable for VLA-RL, particularly addressing sparse reward issues and enhancing sample efficiency [30]. - It points out the engineering challenges in improving sampling efficiency and managing memory costs in VLA scenarios [30]. - The exploration of effective reward design and the implementation of RL in non-autoregressive VLA structures are identified as critical areas for future research [30].
大模型强化学习,相比PPO,DPO 还是个弟弟?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the theoretical and experimental shortcomings of DPO (Direct Preference Optimization) compared to PPO (Proximal Policy Optimization), highlighting that while DPO appears to lead in open-source benchmarks, top closed-source models like GPT-4 and Claude utilize PPO [1][2]. DPO's Deficiencies - DPO encounters issues similar to reward hacking, where it can produce solutions that do not align with human preferences, despite lacking an explicit reward model [2]. - The theoretical framework suggests that the strategies derived from PPO are a true subset of those from DPO when given true reward signals, indicating that DPO may generate solutions that deviate from reference strategies [3]. Experimental Findings - Experiments reveal that DPO can assign higher probabilities to data points not covered in the preference dataset, leading to unexpected behaviors, while PPO optimizes effectively under KL constraints [6]. - The performance of DPO can be improved by reducing distribution drift through methods like SafeSFT, but it still does not surpass PPO [8]. Performance Metrics - Benchmark results consistently show that PPO outperforms both iterative DPO and DPO in various tasks, particularly in programming competitions [10]. - Specific metrics indicate that models using PPO achieve significantly higher pass rates compared to those using DPO, with PPO models reaching up to 44.4% in pass@5 metrics, while DPO models struggle to achieve meaningful results [11][12]. Conclusion - The findings suggest that while DPO has theoretical merits, its practical application in high-stakes tasks like programming is limited compared to PPO, which continues to set new standards in performance [13].
Model Maxxing: RFT, DPO, SFT with OpenAI — Ilan Bigio, OpenAI
AI Engineer· 2025-06-17 03:49
Full workshop covering all forms of fine-tuning and prompt engineering, like SFT, DPO, RFT, prompt engineering / optimization, and agent scaffolding. About Ilan Bigio Ilan Bigio is a founding member of OpenAI’s Developer Experience team where he explores model capabilities, builds demos and developer tools, and shares his learnings through talks and docs. His work includes creating the AI phone ordering demo showcased at DevDay 2024, leading technical development for Swarm, the precursor to the Agents SDK, ...