π0.5开源前，国内也开源了一个强大的端到端统一基础模型！具备强泛化和长程操作

Core Viewpoint - The article discusses the release of π0.5 and WALL-OSS, highlighting their advancements in embodied intelligence and the significance of these models in the robotics industry, particularly in enhancing task execution in complex environments [1][3][5]. Group 1: Model Capabilities - π0.5 demonstrates enhanced generalization capabilities through heterogeneous task collaborative training, enabling robots to perform long-term, fine-grained operations in new household environments [3][5]. - WALL-OSS achieves embodied perception through large-scale multimodal pre-training, allowing seamless integration of instruction reasoning, sub-goal decomposition, and fine-grained action synthesis within a single differentiable framework [8][18]. - The model exhibits high success rates in complex long-term manipulation tasks, showcasing robust instruction-following abilities and understanding of complex scenarios, surpassing existing baseline models [8][18][28]. Group 2: Training and Data - The training process for WALL-OSS involves discrete, continuous, and joint phases, requiring only RTX 4090-level computational power for training and inference deployment [14][15]. - A multi-source dataset centered on embodied tasks was constructed, addressing the lack of large-scale, aligned VLA supervision and current visual language models' spatial understanding gaps [20][22]. - The dataset includes thousands of hours of data, focusing on both short-range operation tasks and long-range reasoning tasks, ensuring comprehensive training for the model [20][22][24]. Group 3: Experimental Analysis - Experimental analysis on embodied visual question answering and six robotic operation tasks focused on language instruction understanding, reasoning, and generalization, as well as planning and execution of long-term, multi-stage tasks [25][31]. - WALL-OSS significantly outperformed its original baseline model in object grounding, scene captioning, and action planning tasks, demonstrating its enhanced scene understanding capabilities [27][28]. - The model's ability to follow novel instructions without task-specific fine-tuning was validated, achieving 85% average task progress on known object instructions and 61% on novel object instructions [29][31]. Group 4: Industry Impact - The advancements in WALL-OSS and π0.5 are positioned to address existing limitations in visual language models and embodied understanding, paving the way for more capable and versatile robotic systems [5][8][20]. - The company, established in December 2023, focuses on developing a general embodied intelligence model using real-world data, aiming to create robots with fine operational capabilities [39]. - The recent completion of a nearly 1 billion yuan A+ round of financing indicates strong investor confidence in the company's direction and potential impact on the industry [39].