SFT 还是RL，VLA到底应该如何训练？

Core Insights - The articles focus on advancements in Reinforcement Learning (RL) and its application to Visual-Language-Action (VLA) models, highlighting significant improvements in generalization capabilities and training efficiency. Group 1: Research Findings - The first study investigates how RL enhances the generalization ability of VLA models, addressing issues related to supervised fine-tuning (SFT) that lead to error accumulation and distribution shift. A new benchmark covering visual, semantic, and execution dimensions was established, showing that using Proximal Policy Optimization (PPO) for RL fine-tuning significantly improves semantic understanding and execution robustness while maintaining comparable visual generalization performance to SFT [2]. - The second study introduces RLinf-VLA, a framework designed for large-scale RL training of VLA models. It proposes a novel solution to the challenges of integrating RL and VLA training, achieving up to 2.27 times acceleration compared to baseline methods. The framework supports various VLA architectures and RL algorithms, achieving a 98.11% success rate across 130 LIBERO tasks [3]. Group 2: Practical Applications - RLinf-VLA summarizes best practices for applying RL in VLA training, providing a unified interface that facilitates the use of multiple VLA architectures and simulators, thus lowering the barrier for implementing RL in large-scale VLA applications [3]. - The research emphasizes the importance of RL in enhancing the performance of VLA models, suggesting a shift towards more efficient training methodologies that leverage RL's strengths [15].