Workflow
Closed-loop Simulation System
icon
Search documents
上交OmniNWM:突破三维驾驶仿真极限的「全知」世界模型
自动驾驶之心· 2025-10-24 16:03
Core Insights - The article discusses the OmniNWM research, which proposes a panoramic, multi-modal driving navigation world model that significantly surpasses existing state-of-the-art (SOTA) models in terms of generation quality, control precision, and long-term stability, setting a new benchmark for simulation training and closed-loop evaluation in autonomous driving [2][58]. Group 1: OmniNWM Features - OmniNWM integrates state generation, action control, and reward evaluation into a unified framework, addressing the limitations of existing models that rely on single-modal RGB video and sparse action encoding [10][11]. - The model utilizes a Panoramic Diffusion Transformer (PDiT) to jointly generate pixel-aligned outputs across four modalities: RGB, semantic, depth, and 3D occupancy [12][11]. - OmniNWM introduces a normalized Plücker Ray-map for action control, allowing for pixel-level guidance and improved generalization across out-of-distribution (OOD) trajectories [18][22]. Group 2: Challenges and Solutions - The article identifies three core challenges in current autonomous driving world models: limitations in state representation, ambiguity in action control, and lack of integrated reward mechanisms [8][10]. - OmniNWM's approach to state generation overcomes the limitations of existing models by capturing the full geometric and semantic complexity of real-world driving scenarios [10][11]. - The model's reward system is based on the generated 3D occupancy, providing a dense and integrated reward function that enhances the evaluation of driving behavior [35][36]. Group 3: Performance Metrics - OmniNWM supports the generation of long video sequences, exceeding the ground truth length with stable outputs, demonstrating its capability to generate over 321 frames [31][29]. - The model achieves significant improvements in video generation quality, outperforming existing models in metrics such as FID and FVD [51][52]. - The integration of a Vision-Language-Action (VLA) planner enhances the model's ability to understand multi-modal environments and output high-precision trajectories [43][50].