Workflow
视觉 - 语言 - 动作VLA模型
icon
Search documents
全部超越π0、π0.5!端到端全身VLA模型Lumo-1
自动驾驶之心· 2025-12-12 03:02
Core Insights - The article discusses the advancements in robotics, particularly focusing on the Lumo-1 model developed by Stardust Intelligence, which aims to enhance robots' reasoning and action capabilities, allowing them to perform complex tasks without explicit programming [9][11][12]. Group 1: Lumo-1 Model Overview - Lumo-1 is an end-to-end VLA model designed to enable robots to understand and execute tasks through reasoning, rather than just mimicking actions [9]. - The model demonstrates superior operational intelligence and generalization capabilities, outperforming previous models like π0 and π0.5 in multi-step tasks and handling unseen objects and instructions [11][13]. Group 2: Training Phases - The training of Lumo-1 consists of three stages: 1. Embodied VLM pre-training on visual-language data to develop spatial understanding and trajectory inference [17]. 2. Cross-domain joint training to enhance instruction following and spatial reasoning [18]. 3. Real-world reasoning-action training using the Astribot S1 robot to learn executable action patterns [18][20]. Group 3: Technical Innovations - Lumo-1 employs a Spatial Action Tokenizer (SAT) to model action spaces, allowing for the combination and reuse of actions in a structured manner [21]. - The model integrates structured reasoning to form a chain of explanations for actions, enabling it to understand the "why" behind tasks before executing the "how" [25]. Group 4: Performance and Validation - Lumo-1 has shown significant improvements in various multimodal benchmarks, outperforming specialized models like RoboBrain-7B and Robix-7B [31]. - The model's ability to adapt to different environments and instructions demonstrates its robust generalization capabilities, such as adjusting arm positions for varying container heights [31]. Group 5: Implications for the Industry - The findings suggest that data diversity in training is more impactful for generalization than merely increasing data volume, indicating a shift in focus towards data quality [30]. - The advancements in Lumo-1 highlight the potential for robots to perform complex tasks autonomously, which could revolutionize industries reliant on automation and robotics [9][11].
端到端全身VLA模型Lumo-1:让机器人心手合一,迈进推理-行动闭环时代
具身智能之心· 2025-12-10 10:00
Core Insights - The article discusses the advancements in robotics, particularly focusing on the Lumo-1 model developed by Stardust Intelligence, which aims to enhance robots' reasoning and action capabilities, allowing them to perform complex tasks without explicit programming [7][9][11]. Group 1: Lumo-1 Model Overview - Lumo-1 is an end-to-end VLA model designed to integrate reasoning and action in robotics, enabling robots to understand task intentions and execute them seamlessly [7][9]. - The model demonstrates superior performance in multi-step tasks, fine manipulation, and generalizable actions compared to previous models like π0 and π0.5, especially in out-of-distribution scenarios [9][11]. Group 2: Training Phases - The training of Lumo-1 consists of three phases: 1. Embodied VLM pre-training on selected visual-language data to develop spatial understanding and trajectory inference [15]. 2. Cross-ontology joint training to enhance instruction following and spatial reasoning capabilities [16]. 3. Real-world reasoning-action training using the Astribot S1 robot to learn executable action patterns [16][18]. Group 3: Reasoning and Action Alignment - Lumo-1 incorporates structured reasoning, allowing the robot to break down tasks into sub-tasks and understand the relationships between actions and instructions [22][30]. - The model employs reinforcement learning for reasoning-action alignment, calibrating the discrepancies between high-level reasoning and low-level actions, which significantly improves task success rates and generalization capabilities [27][28]. Group 4: Performance Metrics - Lumo-1 outperforms mainstream models in six out of seven multimodal benchmark tests, demonstrating its robust multimodal perception and reasoning abilities without compromising its core functionalities [29]. - The model's ability to adapt to various environments and tasks, such as adjusting arm positions for different container heights and recognizing handwritten menus, showcases its impressive generalization capabilities [29].