小米开源首个跨域具身基座模型MiMo-Embodied,29个榜单SOTA
XIAOMIXIAOMI(HK:01810) 机器之心·2025-11-26 09:19

Core Insights - The article discusses the development of MiMo-Embodied, a foundational model that integrates autonomous driving and embodied intelligence, marking a significant advancement in AI research [5][46]. - The model addresses the fragmentation between autonomous driving and embodied AI, which have traditionally been treated as separate domains, leading to a lack of a unified cognitive framework [4][9]. Group 1: Model Development and Architecture - MiMo-Embodied is the first open-source model successfully merging autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) results across 17 benchmarks in embodied intelligence and 12 in autonomous driving [5][19]. - The model is built on Xiaomi's self-developed MiMo-VL architecture, which decomposes physical interactions into six core dimensions, enhancing both environmental perception and decision-making capabilities [11][12]. Group 2: Training Strategy - A four-stage progressive training strategy was designed to effectively integrate diverse cross-domain data while avoiding catastrophic forgetting, which is crucial for the model's performance [13][14]. - The training phases include: 1. Establishing foundational knowledge with general and embodied data [14]. 2. Injecting autonomous driving knowledge through mixed supervision while retaining embodied data [14][15]. 3. Enhancing logical reasoning capabilities using Chain-of-Thought (CoT) techniques [15]. 4. Refining the model through reinforcement learning (RL) to improve output precision [16]. Group 3: Performance Metrics - MiMo-Embodied achieved record-breaking performance in key areas such as affordance prediction, task planning, and spatial understanding, demonstrating its robust capabilities in embodied intelligence [19][22][25]. - In autonomous driving benchmarks, the model excelled in environmental perception, state prediction, and driving planning, showcasing its ability to generate coherent and contextually appropriate driving decisions [27][28][30]. Group 4: Real-World Applications - The model's practical utility was validated in embodied navigation and operation tasks, where it performed exceptionally well in identifying and locating objects in various household scenarios [33][34]. - In autonomous driving trajectory planning, MiMo-Embodied significantly outperformed competing models in both imitation learning and reinforcement learning phases, indicating its effectiveness in complex driving situations [38][39]. Group 5: Conclusion and Future Implications - The introduction of MiMo-Embodied signifies a new phase in embodied intelligence research, proving that cognitive logic in the physical world is unified across different applications [46]. - This work lays the groundwork for developing general Vision-Language-Action (VLA) models, moving towards the vision of a single brain applicable to various embodied forms [46].