小米打通智驾和具身大模型,然后开源了
XIAOMIXIAOMI(HK:01810) 量子位·2025-11-25 09:32

Core Insights - The article discusses the launch of MiMo-Embodied, the world's first unified base model for autonomous driving and embodied operations, developed by Xiaomi's Chen Long team [1][3]. Group 1: Model Overview - MiMo-Embodied is based on the MiMo-VL architecture and addresses the knowledge transfer challenges between autonomous driving and embodied operation scenarios by creating a high-quality dataset that includes general vision, embodied tasks, and driving scenes [3][10]. - The model employs a progressive four-stage training strategy that incorporates Chain of Thought (CoT) and Reinforcement Learning (RL), achieving state-of-the-art (SOTA) performance across 29 benchmarks in both autonomous driving and embodied intelligence [3][21]. Group 2: Challenges Addressed - Previous models in the embodied and autonomous driving fields lacked a unified embodied Visual Language Model (VLM), which limited their ability to interact effectively with the physical world in dynamic environments [6][9]. - The significant domain gap between indoor operations and outdoor driving has hindered the transfer of capabilities across these two areas [8][10]. Group 3: Training Strategy - The training data encompasses three dimensions: general multimodal understanding, embodied AI (including affordance prediction, planning, and spatial understanding), and autonomous driving (covering perception, prediction, and planning) [15][19]. - The four-stage training strategy includes: 1. Stage 1: Embodied AI Supervised Fine-tuning with general and embodied data [18]. 2. Stage 2: Autonomous Driving Supervised Fine-tuning, focusing on multi-view spatial reasoning and complex traffic scene analysis [20]. 3. Stage 3: CoT Supervised Fine-tuning, enhancing the model's ability to handle complex multi-step problems [20]. 4. Stage 4: RL Fine-Tuning using the GRPO algorithm to optimize accuracy and reliability [20]. Group 4: Performance Evaluation - MiMo-Embodied was evaluated through both qualitative and quantitative assessments, demonstrating competitive results against existing models in various benchmarks for embodied intelligence and autonomous driving [21][23]. - In embodied capabilities, MiMo-Embodied showed particular advantages in affordance prediction and spatial understanding compared to other models [23][24]. - The model also excelled in autonomous driving capabilities, showcasing strong performance in perception, prediction, and planning across diverse real-world driving scenarios [25][26]. Group 5: Real-World Applications - In embodied navigation tasks, MiMo-Embodied outperformed models like GPT-4o and Qwen2.5-VL in object localization and consistent performance across varied household scenarios [27]. - The model demonstrated robust affordance and spatial reasoning abilities in operational tasks [29]. - In autonomous driving, MiMo-Embodied effectively handled complex tasks such as turning at intersections and lane changes, integrating road context and vehicle state for coherent decision-making [33][36].