ICCV 2025 | HERMES：首个统一3D场景理解与生成的世界模型

Core Viewpoint - HERMES presents a unified framework for self-driving technology that integrates both understanding and generation tasks, addressing the challenges of accurately predicting future scenarios while comprehensively understanding the current environment [6][10][26]. Group 1: Introduction to HERMES - HERMES is designed to enhance the capabilities of autonomous vehicles by combining deep environmental understanding with accurate future scene predictions [6][9]. - The framework aims to overcome the traditional separation of understanding and generation tasks in existing models, which limits their effectiveness in real-world driving scenarios [7][10]. Group 2: Methodology of HERMES - HERMES utilizes a Driving World Model (DWM) for future scene generation and a Large Language Model (LLM) for scene understanding, creating a synergy between the two [14][12]. - The Bird's-Eye View (BEV) representation is employed to encode high-resolution images efficiently, preserving spatial relationships and semantic details [15]. - A World Queries mechanism is introduced to bridge the gap between understanding and generation, allowing the model to leverage contextual knowledge for better predictions [16]. Group 3: Training and Optimization - HERMES is trained through a joint optimization process that includes language modeling loss and point cloud generation loss, ensuring balanced performance across tasks [18][20]. - The end-to-end training approach allows HERMES to achieve a high level of accuracy in both understanding and generating future scenarios [20]. Group 4: Experimental Results - HERMES outperforms existing models in both scene understanding and future generation tasks, demonstrating a 32.4% reduction in future point cloud error compared to similar models [22]. - The model shows significant improvements in natural language generation metrics, with an 8% increase in CIDEr scores compared to dedicated understanding models [22]. Group 5: Future Outlook - HERMES sets a foundation for further exploration of complex perception tasks, aiming towards the development of a general driving model capable of comprehensive physical world understanding [26][27].