Workflow
HERMES
icon
Search documents
ICCV 2025 | HERMES:首个统一3D场景理解与生成的世界模型
具身智能之心· 2025-08-16 16:03
Core Viewpoint - HERMES presents a unified framework for self-driving technology that integrates both understanding and generation tasks, addressing the challenges of accurately predicting future scenarios while comprehensively understanding the current environment [6][10][26]. Group 1: Introduction to HERMES - HERMES is designed to enhance the capabilities of autonomous vehicles by combining deep environmental understanding with accurate future scene predictions [6][9]. - The framework aims to overcome the traditional separation of understanding and generation tasks in existing models, which limits their effectiveness in real-world driving scenarios [7][10]. Group 2: Methodology of HERMES - HERMES utilizes a Driving World Model (DWM) for future scene generation and a Large Language Model (LLM) for scene understanding, creating a synergy between the two [14][12]. - The Bird's-Eye View (BEV) representation is employed to encode high-resolution images efficiently, preserving spatial relationships and semantic details [15]. - A World Queries mechanism is introduced to bridge the gap between understanding and generation, allowing the model to leverage contextual knowledge for better predictions [16]. Group 3: Training and Optimization - HERMES is trained through a joint optimization process that includes language modeling loss and point cloud generation loss, ensuring balanced performance across tasks [18][20]. - The end-to-end training approach allows HERMES to achieve a high level of accuracy in both understanding and generating future scenarios [20]. Group 4: Experimental Results - HERMES outperforms existing models in both scene understanding and future generation tasks, demonstrating a 32.4% reduction in future point cloud error compared to similar models [22]. - The model shows significant improvements in natural language generation metrics, with an 8% increase in CIDEr scores compared to dedicated understanding models [22]. Group 5: Future Outlook - HERMES sets a foundation for further exploration of complex perception tasks, aiming towards the development of a general driving model capable of comprehensive physical world understanding [26][27].
ICCV 2025 | HERMES:首个统一3D场景理解与生成的世界模型
机器之心· 2025-08-14 04:57
Core Viewpoint - The article discusses the advancements in autonomous driving technology, emphasizing the need for a unified model that integrates both understanding current environments and predicting future scenarios effectively [7][10][30]. Research Background and Motivation - Recent progress in autonomous driving necessitates vehicles to possess deep understanding of current environments and accurate predictions of future scenarios to ensure safe and efficient navigation [7]. - The separation of "understanding" and "generation" in mainstream solutions is highlighted as a limitation in achieving effective decision-making in real-world driving scenarios [8][10]. Method: HERMES Unified Framework - HERMES proposes a unified framework that utilizes a shared large language model (LLM) to drive both understanding and generation tasks simultaneously [13][30]. - The framework addresses challenges such as efficiently inputting high-resolution images and integrating world knowledge with predictive capabilities [11][12]. HERMES Core Design - HERMES employs Bird's-Eye View (BEV) as a unified scene representation, allowing for efficient encoding of multiple images while preserving spatial relationships and semantic details [18]. - The introduction of World Queries facilitates the connection between understanding and future predictions, enhancing the model's ability to generate accurate future scenarios [19][20]. Joint Training and Optimization - HERMES utilizes a joint training process with two optimization objectives: language modeling loss for understanding tasks and point cloud generation loss for accuracy in future predictions [21][22][23]. Experimental Results and Visualization - HERMES demonstrates superior performance in scene understanding and future generation tasks on datasets like nuScenes and OmniDrive-nuScenes [26]. - The model excels in generating coherent future point clouds and accurately describing driving scenes, showcasing its comprehensive capabilities [27]. Summary and Future Outlook - HERMES presents a new paradigm for autonomous driving world models, effectively bridging the gap between 3D scene understanding and future generation [30]. - The model shows significant improvements in prediction accuracy and understanding tasks compared to traditional models, validating the effectiveness of unified modeling [31].
ICCV‘25 | 华科提出HERMES:首个统一驾驶世界模型!
自动驾驶之心· 2025-07-25 10:47
Core Viewpoint - The article introduces HERMES, a unified driving world model that integrates 3D scene understanding and future scene generation, significantly reducing generation errors by 32.4% compared to existing methods [4][17]. Group 1: Model Overview - HERMES addresses the fragmentation in existing driving world models by combining scene generation and understanding capabilities [3]. - The model utilizes a BEV (Bird's Eye View) representation to integrate multi-view spatial information and introduces a "world query" mechanism to enhance scene generation with world knowledge [3][4]. Group 2: Challenges and Solutions - The model overcomes the challenge of multi-view spatiality by employing a BEV-based world tokenizer, which compresses multi-view images into BEV features, thus preserving key spatial information while adhering to token length limitations [5]. - To address the integration of understanding and generation, HERMES introduces world queries that enhance the generated scenes with world knowledge, bridging the gap between understanding and generation [8]. Group 3: Performance Metrics - HERMES demonstrates superior performance on the nuScenes and OmniDrive-nuScenes datasets, achieving an 8.0% improvement in the CIDEr metric for understanding tasks and significantly lower Chamfer distances in generation tasks [4][17]. - The model's world query mechanism contributes to a 10% reduction in Chamfer distance for 3-second point cloud predictions, showcasing its effectiveness in enhancing generation performance [20]. Group 4: Experimental Validation - The experiments utilized datasets such as nuScenes, NuInteract, and OmniDrive-nuScenes, employing metrics like METEOR, CIDEr, ROUGE for understanding tasks, and Chamfer distance for generation tasks [19]. - Ablation studies confirm the importance of the interaction between understanding and generation, with the unified framework outperforming separate training methods [18]. Group 5: Qualitative Results - HERMES is capable of accurately generating future point cloud evolutions and understanding complex scenes, although challenges remain in scenarios involving complex turns, occlusions, and nighttime conditions [24].