RL 将如何提高具身大模型 VLA 泛化性？清华大学团队NeurIPS 2025文章分析 RL 与 SFT 泛化性差异

Core Insights - The article discusses the potential of Vision-Language-Action (VLA) large models in embodied intelligence, highlighting the limitations of current supervised fine-tuning (SFT) methods in generalization to new environments and tasks. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [2][4]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [4]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, maintaining performance comparable to SFT in visually varied scenarios [4][11]. Group 2: RL Methodology - The research team tested three RL algorithms: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). The results showed that PPO outperformed DPO and GRPO in multi-step decision tasks due to the partially observable Markov decision process (POMDP) characteristics of robotic tasks [9][11]. - To enhance the efficiency of PPO training on VLA models, three key innovations were introduced: a shared Actor-Critic architecture reducing memory usage by 45% and increasing training speed by 35%, a preheating strategy using 140 high-quality trajectories to improve convergence speed by 50%, and minimizing PPO training epochs to just one, which reduced training time significantly [13][15]. Group 3: Comparison of SFT and RL - The research explored the data scale limits of SFT, finding that performance saturation occurred at around 16,000 demonstration trajectories. In contrast, RL achieved a 42.6% performance improvement on out-of-distribution tasks, indicating superior generalization capabilities [18][19]. - A comprehensive evaluation benchmark was constructed to dissect the generalization differences between SFT and RL across visual, semantic, and execution dimensions, with RL showing clear advantages in semantic understanding and execution robustness [21][23]. Group 4: Practical Implications - The study underscores the core value of RL in developing truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and variable. The team has open-sourced a large-scale RL framework for embodied intelligence, RLinf, to facilitate further research [25]. - Visual analysis of specific cases revealed deeper differences, such as RL's ability to maintain task stability under noise and effectively handle unseen objects, contrasting with SFT's tendency to get stuck in repetitive actions [26].