Workflow
Flow Matching VLA
icon
Search documents
RLinf上新πRL:在线强化学习微调π0和π0.5
机器之心· 2025-11-06 08:58
Core Insights - The article discusses the advancements in the field of robotics, particularly focusing on the VLA models π0 and π0.5 developed by Physical Intelligence, which utilize flow matching techniques to generate high-dimensional and smooth continuous action sequences, demonstrating significant advantages in complex manipulation tasks [2][3]. Group 1: VLA Models and Challenges - VLA models heavily rely on large-scale, high-quality human demonstration data, which is costly and time-consuming to collect and annotate [2]. - Reinforcement learning (RL) allows agents to explore and iteratively improve through real interactions with the environment, reducing the dependency on extensive data and enhancing the performance ceiling of supervised fine-tuning (SFT) [2]. Group 2: πRL Framework - A collaborative effort from institutions like Tsinghua University, Peking University, and CMU has led to the development of the πRL framework for online reinforcement learning fine-tuning of flow matching VLA models [3]. - The πRL framework achieved an average success rate of 97.6% for π0 and 98.3% for π0.5 on the LIBERO testing platform, validating the effectiveness of the fine-tuning approach [3]. Group 3: Technical Innovations - πRL introduces two technical routes: Flow-Noise and Flow-SDE, addressing the challenge of directly calculating the log-likelihood of output actions in flow matching VLA [8][10]. - Flow-Noise models the denoising process as a discrete Markov process, enabling the direct computation of the joint probability density of the denoised sequence [10]. - Flow-SDE combines the denoising process with environmental interaction, constructing a two-layer Markov Decision Process (MDP) [20]. Group 4: Performance Improvements - The πRL framework demonstrated a success rate increase of over 40% across 4,352 grasp-and-place task combinations, achieving final success rates exceeding 80% [3][24]. - In the LIBERO testing platform, πRL improved the average success rate of π0 from 57.6% to 97.6% and π0.5 from 77.1% to 98.3%, surpassing the performance of fully data-trained flow matching VLAs [19]. Group 5: Generalization and Robustness - The πRL algorithm significantly enhances the generalization capabilities of both models in new environments, as evidenced by tests involving domain randomization [26]. - The framework's ability to reduce the average number of steps required to complete tasks indicates improved efficiency compared to supervised fine-tuning [28]. Group 6: Future Directions - Future developments of πRL will include more benchmark tests, deeper analysis of out-of-distribution (OOD) generalization capabilities, and further exploration of critic design for improved stability [35][36].