Core Insights - The article discusses Xiaomi's MiMo-Embodied, a cross-domain foundational model that integrates autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) performance across 29 benchmark tests [5][24]. Group 1: Model Overview - MiMo-Embodied is the first open-source unified model that combines tasks from autonomous driving and embodied intelligence into a single framework, enabling positive transfer and mutual enhancement between the two domains [7][8]. - The model supports three core capabilities in autonomous driving: environment perception, state prediction, and driving planning, as well as three core capabilities in embodied intelligence: usability prediction, task planning, and spatial understanding [8]. Group 2: Training and Data Strategy - The model employs a multi-stage training strategy with carefully designed datasets to overcome cross-domain task interference, leading to performance improvements [9][20]. - The training process consists of four stages: general and embodied knowledge learning, autonomous driving knowledge learning, chain-of-thought (CoT) reasoning fine-tuning, and reinforcement learning (RL) fine-tuning [21][27]. Group 3: Performance Metrics - MiMo-Embodied has achieved SOTA in usability prediction across five benchmarks, outperforming models like Qwen2.5-VL and GPT-4o [24]. - In task planning, it demonstrates strong long-range reasoning and causal inference capabilities, particularly in the RoboVQA benchmark [24]. - The model excels in spatial understanding and environment perception, leading in nine benchmarks, especially in 3D scene reasoning and spatial language localization [24][25]. Group 4: Comparative Analysis - The model's performance in various benchmarks shows significant improvements over previous models, with an average performance increase of 4% in embodied tasks and 8.1% in autonomous driving tasks compared to mixed training approaches [27][37]. - MiMo-Embodied's architecture and training strategy allow it to maintain high performance across both domains, achieving an average score of 62.4% in embodied tasks and 63.3% in autonomous driving tasks [37].
小米的MiMo-Embodied:整合自驾和具身任务,29项SOTA!