端到端全身VLA模型Lumo-1：让机器人心手合一，迈进推理-行动闭环时代

Core Insights - The article discusses the advancements in robotics, particularly focusing on the Lumo-1 model developed by Stardust Intelligence, which aims to enhance robots' reasoning and action capabilities, allowing them to perform complex tasks without explicit programming [7][9][11]. Group 1: Lumo-1 Model Overview - Lumo-1 is an end-to-end VLA model designed to integrate reasoning and action in robotics, enabling robots to understand task intentions and execute them seamlessly [7][9]. - The model demonstrates superior performance in multi-step tasks, fine manipulation, and generalizable actions compared to previous models like π0 and π0.5, especially in out-of-distribution scenarios [9][11]. Group 2: Training Phases - The training of Lumo-1 consists of three phases: 1. Embodied VLM pre-training on selected visual-language data to develop spatial understanding and trajectory inference [15]. 2. Cross-ontology joint training to enhance instruction following and spatial reasoning capabilities [16]. 3. Real-world reasoning-action training using the Astribot S1 robot to learn executable action patterns [16][18]. Group 3: Reasoning and Action Alignment - Lumo-1 incorporates structured reasoning, allowing the robot to break down tasks into sub-tasks and understand the relationships between actions and instructions [22][30]. - The model employs reinforcement learning for reasoning-action alignment, calibrating the discrepancies between high-level reasoning and low-level actions, which significantly improves task success rates and generalization capabilities [27][28]. Group 4: Performance Metrics - Lumo-1 outperforms mainstream models in six out of seven multimodal benchmark tests, demonstrating its robust multimodal perception and reasoning abilities without compromising its core functionalities [29]. - The model's ability to adapt to various environments and tasks, such as adjusting arm positions for different container heights and recognizing handwritten menus, showcases its impressive generalization capabilities [29].