EmbodiedBrain模型
Search documents
突破具身智能任务规划边界,刷新具身大脑多榜单SOTA,中兴EmbodiedBrain模型让具身大脑学会「复杂规划」
机器之心· 2025-12-03 08:30
Core Insights - The article discusses the development of the EmbodiedBrain model by ZTE NebulaBrain Team, which aims to address the limitations of current large language models (LLMs) in embodied tasks, focusing on robust spatial perception, efficient task planning, and adaptive execution in real-world environments [2][4]. Group 1: Model Architecture - EmbodiedBrain utilizes a modular encoder-decoder architecture based on Qwen2.5-VL, achieving an integrated loop of perception, reasoning, and action [5]. - The model processes various multimodal inputs, including images, video sequences, and complex language instructions, generating structured outputs for direct control and interaction with embodied environments [8][10]. - Key components include a visual transformer for image processing, a lightweight MLP for visual-language integration, and a decoder that enhances temporal understanding of dynamic scenes [9][10]. Group 2: Data and Training - The model features a structured data architecture designed for embodied intelligence, ensuring alignment between high-level task goals and low-level execution steps [12]. - Training data encompasses four core categories: general multimodal instruction data, spatial reasoning data, task planning data, and video understanding data, with a focus on quality through multi-stage filtering [14][15]. - The training process includes a two-stage rejection sampling method to enhance model perception and reasoning capabilities, followed by a multi-task reinforcement learning approach called Step-GRPO to improve long-sequence task handling [20][21]. Group 3: Evaluation System - EmbodiedBrain establishes a comprehensive evaluation system covering general multimodal capabilities, spatial perception, and end-to-end simulation planning, addressing the limitations of traditional offline assessments [26][27]. - The model demonstrates superior performance in various benchmarks, including MM-IFEval and MMStar, indicating its enhanced multimodal capabilities compared to competitors [28][29]. - In spatial reasoning and task planning evaluations, EmbodiedBrain achieves significant improvements, showcasing its ability to perform complex tasks effectively [30][31]. Group 4: Case Studies and Future Outlook - The model successfully executes tasks involving spatial reasoning and end-to-end execution, demonstrating its capability to generate coherent action sequences based on complex instructions [37][41]. - ZTE plans to open-source the EmbodiedBrain model and its training data, aiming to foster collaboration in the field of embodied intelligence and address existing challenges in data accessibility and evaluation standards [42][43]. - Future developments will focus on multi-agent collaboration and enhancing adaptability across various real-world robotic platforms, pushing the boundaries of embodied intelligence applications [43].