智元机器人发布并开源首个机器人动作序列驱动的世界模型

Core Viewpoint - The article highlights the significant breakthroughs by ZhiYuan Robotics in the field of embodied intelligence, introducing the world's first action sequence-driven embodied world model EVAC and the evaluation benchmark EWMBench, both of which are now open-source, aiming to create a new development paradigm for low-cost simulation, standardized evaluation, and efficient iteration [1][2]. Group 1: EVAC Overview - EVAC represents a dynamic world model capable of accurately reproducing complex interactions between robots and their environments, marking a transition from traditional simulation to generative simulation [4]. - The core capability of EVAC includes precise mapping from "physical execution" to "pixel space," utilizing a multi-level action condition injection mechanism to achieve end-to-end generation of physical actions and visual dynamics [6]. Group 2: Key Features of EVAC - High-precision alignment of robot actions and pixels is achieved by projecting the 6D pose of robotic arms into an action map, ensuring pixel-level alignment for complex dynamic behaviors such as "grasping," "placing," and "colliding" [8]. - EVAC introduces dynamic multi-view modeling through Ray Map encoding of camera motion trajectories, enabling consistent and coherent visual scene generation from multiple perspectives [8]. Group 3: Generative Simulation Evaluation - To address the high costs and risks associated with real machine evaluations, EVAC proposes a generative simulation evaluation scheme that constructs a complete interactive evaluation pipeline, showing high consistency with real machine evaluation success rates [10]. - The data augmentation engine of EVAC can significantly enhance task success rates by up to 29% using minimal expert trajectory data through action interpolation and high-fidelity image generation techniques [12]. Group 4: EWMBench Introduction - EWMBench is introduced as the world's first evaluation benchmark for embodied world models, aiming to fill a gap in the industry by establishing a unified and credible evaluation standard [15]. - The evaluation system consists of three dimensions: scene consistency, motion correctness, and semantic alignment & diversity, providing a comprehensive analysis of the generated models [17]. Group 5: Performance and Data Support - EWMBench demonstrates superior performance in aligning evaluation results with human subjective judgments compared to existing benchmarks, reflecting the actual capabilities of embodied world models in interaction understanding and visual consistency [21]. - The benchmark is built on the AgiBot World dataset, which includes over 300 carefully designed test samples across various robotic tasks, ensuring robust validation of models in complex environments [22].