对VLA的RL最新进展的梳理~

Core Viewpoint - The article discusses the recent advancements in Vision-Language-Action (VLA) models, particularly focusing on the integration of Reinforcement Learning (RL) techniques to enhance their performance and stability in various tasks [1]. Group 1: Early Exploration of iRe-VLA - The core algorithm of iRe-VLA is PPO, which introduces a two-stage training paradigm to address instability in online reinforcement learning [2]. - The implementation utilizes BLIP-2 3B as the VLM backbone, replacing the final fully connected layer with an action head that includes a token learner and an MLP [2]. - The experimental setup involves simulation environments like Meatworld and Franka Kitchen, with tasks divided into three categories for evaluation [2]. Group 2: Preference Alignment with GRAPE - GRAPE introduces preference alignment into VLA training, specifically designed for VLA characteristics [6]. - The reward for each trajectory is composed of three parts: success reward, self-reward, and external reward based on a custom cost function [8]. - The external reward is calculated by decomposing trajectories into stages and evaluating them using a VLM task decomposer [9]. Group 3: LOOP and RIPT-VLA - LOOP combines RLOO and PPO to address challenges in sparse rewards and long sequences in multi-task scenarios [11]. - The RIPT-VLA employs the LOOP algorithm for online RL and provides open-source code for implementation [13]. - The approach includes various tricks to enhance training efficiency, such as dynamic rejection mechanisms and multi-task sampling [15]. Group 4: System and Algorithm Innovations in RL4VLA - RL4VLA models the action generation process as a multi-modal dialogue, using PPO training with dense pseudo-rewards to guide the training process [18]. - The training involves a Robotic Process Reward Model that predicts the likelihood of action sequences, enhancing the reward signal [20]. - The article emphasizes adaptive curriculum selection strategies to improve sample efficiency and generalization capabilities [21][23]. Group 5: Engineering Challenges and Future Directions - The article highlights the need for new RL algorithms suitable for VLA-RL, particularly addressing sparse reward issues and enhancing sample efficiency [30]. - It points out the engineering challenges in improving sampling efficiency and managing memory costs in VLA scenarios [30]. - The exploration of effective reward design and the implementation of RL in non-autoregressive VLA structures are identified as critical areas for future research [30].