Workflow
视觉 - 语言 - 动作(VLA)大模型
icon
Search documents
NeurIPS 2025|清华团队分析RL将如何提升VLA泛化性
具身智能之心· 2025-10-15 04:00
Core Insights - The article discusses the potential of Vision-Language-Action (VLA) models in embodied intelligence and highlights the limitations of current supervised fine-tuning (SFT) methods in achieving human-like generalization. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [1][3]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [3][19]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, while maintaining performance in visually varied scenarios comparable to SFT [3][12]. Group 2: Methodology - The research utilized the open-source OpenVLA model, which is fine-tuned from Llama2-7b, to conduct experiments involving RGB images and action tokens for robotic control [6]. - Three RL methods were tested: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), with PPO showing notable advantages in multi-step decision tasks [8][15]. Group 3: PPO Training Innovations - The research team proposed three key innovations for efficient PPO training: 1. A shared Actor-Critic architecture that reduced memory usage by 45% and improved training speed by 35% [12][14]. 2. A preheating strategy using 140 high-quality trajectories that enhanced convergence speed by 50% [14]. 3. Minimizing PPO training epochs to just one, which was sufficient for performance without increasing training time [14]. Group 4: Comparison of SFT and RL - The study found that while SFT performance plateaued with 16,000 demonstration trajectories, RL achieved a 42.6% performance improvement in out-of-distribution tasks, indicating superior generalization capabilities [17][18]. - A comprehensive evaluation benchmark was developed to dissect the differences in generalization capabilities between SFT and RL across visual, semantic, and execution dimensions [19][21]. Group 5: Practical Implications - The research underscores the core value of RL in building truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and varied [25].