原力灵机具身大模型DM0硬核拆解：物理AI如何迎来自己的“原生”时代

Core Insights - The article discusses the limitations of current large language models (LLMs) and vision-language models (VLMs) in physical robotics, emphasizing the need for a new approach that integrates physical grounding from the outset [1][2] - The DM0 model, developed by Yuanliang and Jie, is introduced as an embodied-native vision-language-action model that combines various data sources to enhance physical interaction capabilities [3][5] Model Architecture and Training - DM0 employs a multi-source mixed training approach and an embodied spatial scaffolding architecture to harmonize heterogeneous data, including internet corpora, autonomous driving logs, and robotic operation trajectories [5][8] - The model consists of two main components: a VLM backbone for multimodal perception and a flow-matching-based action expert for continuous control [12][13] - The training pipeline is divided into three stages: pre-training with 1.13 trillion tokens, mid-training with 200 million samples, and post-training with 50 million samples, focusing on aligning the model with specific robotic platforms [16][17][18][19] Performance Evaluation - DM0 demonstrated superior performance in the RoboChallenge benchmark, achieving a 62.00% average success rate in single-task evaluations, outperforming larger models like Spirit-v1.5 and GigaBrain-0.1 [24] - In multi-task evaluations, DM0 achieved a 37.3% average success rate and a task score of 49.08, significantly surpassing the previous best model, pi0.5 [27] Future Directions - The authors suggest potential future developments for DM0, including scaling the model to 7B or 30B parameters, integrating multimodal sensory feedback, and enhancing long-term reasoning capabilities [32]