IROS'25冠军方案：X-VLA重磅开源，全面刷新机器人SOTA！

Core Viewpoint - The article discusses the launch of the X-VLA model, a groundbreaking open-source model in the field of embodied intelligence, which has achieved significant performance improvements in autonomous tasks such as folding clothes, showcasing its robustness and generalization capabilities [2][5][7]. Group 1: Model Performance and Achievements - X-VLA is the first open-source model to accomplish a 120-minute autonomous clothing folding task without assistance, achieving state-of-the-art (SOTA) performance with only 0.9 billion parameters across five authoritative simulation benchmarks [2][7]. - The X-VLA team won first place in the IROS-AGIBOT World Challenge, competing against 431 teams from 23 countries, demonstrating exceptional performance in real physical tasks such as grasping, folding, cooking, and pouring [4][5]. Group 2: Technical Innovations - The model employs a Soft-Prompt mechanism to enhance adaptability across different robotic platforms, allowing for improved stability and efficiency in training with heterogeneous data [16]. - A multi-modal encoding strategy is introduced to handle diverse visual inputs, optimizing resource allocation while maintaining information integrity [16]. - The use of flow-matching in the action decoder enhances the smoothness and robustness of action trajectories, crucial for executing long-sequence tasks [17]. Group 3: Data and Training Strategies - X-VLA utilizes a balanced data sampling strategy to ensure equitable training across heterogeneous datasets, preventing model bias [21]. - A rigorous data cleaning and temporal alignment pipeline is implemented to enhance the quality and consistency of the state-action sequences [21]. - The model's training process includes a customized post-training workflow that allows for efficient adaptation to specific tasks using smaller datasets [23][26]. Group 4: Experimental Results - In various authoritative simulation environments, X-VLA achieved SOTA performance, significantly outperforming existing models [24]. - The model demonstrated strong performance in real-world robotic tasks, successfully completing complex long-duration tasks like autonomous clothing folding [27].