Core Viewpoint - The article discusses the challenges and advancements in embodied intelligence, particularly focusing on the new model LingBot-VLA, which surpasses the previous benchmark Pi0.5 in terms of generalization capabilities and training efficiency [2][26]. Group 1: Current Challenges in Embodied Intelligence - The current state of embodied intelligence is characterized by a lack of generalization ability, where robots can perform specific tasks but struggle with broader applications [2]. - The industry consensus is that larger and more diverse real-world robot data is needed to improve model training and task understanding [2][10]. - High-quality data collection is costly, and the difficulty in reusing data across different robot configurations limits the effectiveness of many models [2][8]. Group 2: Introduction of LingBot-VLA - LingBot-VLA is based on approximately 20,000 hours of real-world data from nine different robot configurations, allowing it to perform over 100 tasks effectively [2][10]. - The model has shown significant improvements in success rates compared to Pi0.5, with an average success rate increase of 4.28% and a partial success rate increase of 7.76% [14][17]. - The model's architecture combines a pre-trained visual language model with a specialized action generator, enhancing its ability to understand and execute complex tasks [20][21]. Group 3: Performance and Testing - LingBot-VLA was tested on the GM-100 benchmark, which includes 100 diverse real-world tasks, demonstrating its superior performance across different robot platforms [12][13]. - The model's success rates were the highest among tested models, indicating its robustness in handling complex and varied tasks [14][17]. - The testing involved 25 robots from three different platforms, emphasizing its cross-platform and cross-task capabilities [13]. Group 4: Efficiency and Scalability - LingBot-VLA exhibits higher data utilization and computational efficiency, outperforming Pi0.5 in training with fewer data samples [17][19]. - The model's training framework allows for significant scalability, maintaining high training throughput even as GPU resources are increased [19][24]. - The systematic optimization of the training codebase has led to a notable increase in training speed, achieving 261 samples per second per GPU [24]. Group 5: Implications for the Future - The advancements represented by LingBot-VLA not only set a new industry standard but also provide empirical evidence that scaling real-world data can lead to stronger generalization in models [26][28]. - The open-source release of LingBot-VLA, along with its associated tools, fosters a collaborative environment for further development in embodied intelligence [28][29]. - The model's development is seen as a strategic move by Ant Group towards integrating embodied intelligence into broader artificial general intelligence (AGI) frameworks [28][29].
蚂蚁出手VLA,就是开源超越Pi0.5的基座模型
机器之心·2026-01-28 03:36