RLINF-VLA：一种用于 VLA+RL 训练的统一高效框架

Core Insights - The article presents the RLinf-VLA framework, a unified and efficient framework for training Visual-Language-Action (VLA) models using Reinforcement Learning (RL), addressing the limitations of existing models that rely on supervised fine-tuning [2][53] - The framework significantly enhances training efficiency and generalization capabilities, achieving high success rates in various simulation tasks and demonstrating superior performance in real-world applications compared to traditional supervised methods [5][53] Framework Design - The RLinf-VLA framework integrates multiple simulators, algorithms, and VLA architectures, optimizing resource allocation through flexible execution modes and system-level enhancements [4][53] - It supports three GPU allocation strategies: colocated, disaggregated, and hybrid, allowing users to easily switch modes via configuration files, thus reducing system customization costs [10][11] Model Compatibility - The framework supports LoRA for efficient parameter tuning, reducing memory consumption and accelerating training while maintaining performance [12] - It is compatible with OpenVLA and its extension OpenVLA-OFT, which have shown strong performance in various robotic operation benchmarks [12][22] Multi-Simulator Support - The framework emphasizes the importance of simulators in RL, utilizing ManiSkill and LIBERO as primary simulators to achieve diverse task capabilities [13] - It provides a unified interface for different simulators, facilitating the implementation of various tasks and supporting multiple RL algorithms, initially focusing on PPO and GRPO [13][14] Algorithm Design - The framework incorporates advanced techniques for advantage function and log-probability calculations, allowing for flexible integration of block-level and action-level definitions [14][15] - It supports various optimization strategies, including trajectory length normalization and effective action masking, to enhance training stability and performance [19][20] Experimental Results - The RLinf-VLA framework demonstrated significant performance improvements, with success rates increasing by 45% to 70% in various tasks compared to baseline models [22][24] - In LIBERO tasks, the framework achieved an average success rate of 98.11%, showcasing its capability for large-scale multi-task reinforcement learning [28] High Efficiency Performance - The framework's efficiency is evaluated based on throughput, achieving substantial improvements in training speed across different GPU configurations [30][35] - The hybrid allocation mode outperformed traditional methods, demonstrating the benefits of pipeline overlapping in resource utilization [35][37] Real-World Deployment - The RLinf-VLA framework was successfully deployed in real-world environments, showing superior zero-shot generalization capabilities compared to supervised fine-tuning strategies [51][53] - The experiments indicated that RL-trained models could adapt better to real-world tasks, achieving higher success rates in object manipulation tasks [51] Conclusion - The RLinf-VLA framework represents a significant advancement in the field of embodied intelligence, providing a robust foundation for future research and development in VLA training [53]