Flow - Noise与Flow - SDE双算法
Search documents
清华大学最新!πRL:用在线强化学习让机器人 “边学边做” 的通用方案
具身智能之心· 2025-11-03 00:03
Core Insights - The article discusses the breakthrough in adapting Reinforcement Learning (RL) for flow-based Vision-Language-Action (VLA) models, overcoming the limitations of traditional supervised fine-tuning (SFT) and existing RL approaches [1][3][30] Group 1: Challenges in Current VLA Model Training - Current VLA model training faces a dilemma: SFT relies on large expert trajectories, which are costly and have weak generalization, while RL cannot adapt to the core characteristics of flow-based models [3][4] - The core issue is the fundamental barrier in RL adaptation for flow-based VLA models, primarily due to the difficulty in calculating action log-likelihood during the denoising process [4][5] Group 2: Innovative Solutions Proposed - A new framework using "Flow-Noise and Flow-SDE dual algorithms + parallel simulation training" has been proposed to address the RL adaptation challenges for flow-based VLA models [1][5] - The Flow-Noise algorithm introduces a learnable noise network to optimize the denoising process, while Flow-SDE converts deterministic ODE denoising into stochastic SDE to balance exploration and efficiency [7][9] Group 3: Performance Improvements - The proposed methods have shown significant performance improvements in multi-task benchmark tests, achieving near-perfect scores and breaking through the SFT bottleneck [15][16] - In the LIBERO benchmark, the Flow-Noise and Flow-SDE models achieved average scores of 97.6% and 96.1% respectively, significantly outperforming traditional SFT methods [16][18] Group 4: Large-Scale Adaptation and Training - The framework supports large-scale multi-task optimization, demonstrated by the ability to handle 4,352 task combinations in the ManiSkill benchmark while maintaining performance advantages [20][22] - The use of 320 parallel environments for training significantly reduces data transmission delays and enhances optimization efficiency [17][22] Group 5: Future Directions - Future research will focus on optimizing noise injection strategies, improving out-of-distribution (OOD) generalization, and validating the framework's adaptability in real-world robotic applications [29][30] - The integration of multi-modal observations, such as tactile and force feedback, is also suggested to enhance robustness in complex scenarios [29][30]