小米的MiMo-Embodied,到底讲的是什么?整合自驾和具身任务,29项SOTA!
具身智能之心·2025-11-22 16:03

Core Insights - The article discusses Xiaomi's MiMo-Embodied, the first cross-domain foundational model that integrates autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) performance across 29 benchmark tests [5][7]. Summary by Sections Existing Model Limitations - Current models are limited to single domains and lack a unified visual language model (VLM) that connects outdoor autonomous driving and indoor embodied intelligence, resulting in insufficient cross-scenario generalization capabilities [5]. MiMo-Embodied's Solutions - MiMo-Embodied is the first open-source cross-domain unified model, integrating tasks from both autonomous driving and embodied intelligence into a single framework, enabling positive transfer and mutual enhancement between the two domains [7]. Comprehensive Capabilities - The model supports three core capabilities for autonomous driving: environment perception, state prediction, and driving planning, as well as three core capabilities for embodied intelligence: usability prediction, task planning, and spatial understanding [8]. Training and Data Construction - MiMo-Embodied employs a carefully designed dataset and a four-stage training strategy to overcome cross-domain task interference, leading to performance improvements [9]. Model Architecture - The architecture includes a Vision Transformer (ViT) for visual encoding, a multi-layer perceptron (MLP) for projection, and a large language model (LLM) for text understanding and logical reasoning [12][13]. Training Strategy - The four-stage training strategy includes: 1. General and embodied knowledge learning 2. Autonomous driving knowledge learning 3. Chain of Thought (CoT) reasoning fine-tuning 4. Reinforcement Learning (RL) fine-tuning [20][21]. Performance Metrics - MiMo-Embodied demonstrates superior performance in usability prediction, task planning, spatial understanding, environment perception, state prediction, and driving planning across various benchmarks [24][25]. Ablation Studies - Single-domain training leads to significant performance loss in cross-domain generalization, while the four-stage training strategy enhances both embodied and autonomous driving performance by 4% and 8.1%, respectively [27]. Real-World Task Evaluation - The model has been evaluated in real-world tasks, showcasing its capabilities in embodied navigation and manipulation tasks [29].