驾驶世界模型
Search documents
理想提出首个包含自车和他车轨迹的世界模型
理想TOP2· 2025-11-23 11:56
Core Viewpoint - The article discusses the development of a driving world model by Li Auto that integrates both ego and other vehicle trajectories, enabling more realistic simulations of driving scenarios and enhancing the training of their VLA (Vehicle Learning Algorithm) through reinforcement learning [1][6]. Group 1: Model Development - The driving world model proposed by Li Auto addresses three main deficiencies of previous models: lack of interactivity, feature distribution mismatch, and spatial mapping difficulties [6]. - The new model, EOT-WM, projects trajectory points into an image coordinate system, allowing for the generation of trajectory videos that unify visual modalities [6][8]. - A spatiotemporal variational autoencoder (STVAE) is employed to encode scene and trajectory videos, achieving aligned feature spaces for effective control [7]. Group 2: Technical Innovations - The model introduces a diffusion Transformer (TiDiT) that integrates motion guidance from trajectory variables into video latent variables for improved denoising of noisy video representations [9]. - A new metric based on the similarity of control latent variables is proposed to evaluate the controllability of predicted trajectories against true trajectory variables [7][9]. Group 3: Contributions - The model is the first to include both ego and other vehicle trajectories, allowing for more realistic simulations of interactions between the ego vehicle and driving scenarios [8]. - It represents trajectories as videos and aligns each trajectory with corresponding vehicles in a unified visual space [9].
ICCV‘25 | 华科提出HERMES:首个统一驾驶世界模型!
自动驾驶之心· 2025-07-25 10:47
Core Viewpoint - The article introduces HERMES, a unified driving world model that integrates 3D scene understanding and future scene generation, significantly reducing generation errors by 32.4% compared to existing methods [4][17]. Group 1: Model Overview - HERMES addresses the fragmentation in existing driving world models by combining scene generation and understanding capabilities [3]. - The model utilizes a BEV (Bird's Eye View) representation to integrate multi-view spatial information and introduces a "world query" mechanism to enhance scene generation with world knowledge [3][4]. Group 2: Challenges and Solutions - The model overcomes the challenge of multi-view spatiality by employing a BEV-based world tokenizer, which compresses multi-view images into BEV features, thus preserving key spatial information while adhering to token length limitations [5]. - To address the integration of understanding and generation, HERMES introduces world queries that enhance the generated scenes with world knowledge, bridging the gap between understanding and generation [8]. Group 3: Performance Metrics - HERMES demonstrates superior performance on the nuScenes and OmniDrive-nuScenes datasets, achieving an 8.0% improvement in the CIDEr metric for understanding tasks and significantly lower Chamfer distances in generation tasks [4][17]. - The model's world query mechanism contributes to a 10% reduction in Chamfer distance for 3-second point cloud predictions, showcasing its effectiveness in enhancing generation performance [20]. Group 4: Experimental Validation - The experiments utilized datasets such as nuScenes, NuInteract, and OmniDrive-nuScenes, employing metrics like METEOR, CIDEr, ROUGE for understanding tasks, and Chamfer distance for generation tasks [19]. - Ablation studies confirm the importance of the interaction between understanding and generation, with the unified framework outperforming separate training methods [18]. Group 5: Qualitative Results - HERMES is capable of accurately generating future point cloud evolutions and understanding complex scenes, although challenges remain in scenarios involving complex turns, occlusions, and nighttime conditions [24].