Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the integration of world models to enhance system robustness and generalization in long-tail scenarios. It introduces DriveLaW, a unified world model that combines video generation and trajectory planning to address existing challenges in autonomous driving systems [2][5][43]. Group 1: Advancements in Autonomous Driving - Recent breakthroughs in perception and planning technologies have significantly improved autonomous driving capabilities [2]. - Existing systems still struggle with long-tail scenarios, limiting closed-loop driving performance [2]. - A surge of research is exploring world models to predict future driving scenarios, enhancing system robustness and generalization [2][3]. Group 2: World Model Applications - World models are being applied in various ways, including synthesizing data for rare scenarios, simulating environments for policy learning, and providing future visual predictions as supervisory signals [3]. - Current world models often lack tight coupling with decision-making processes, leading to indirect contributions to planning [3]. Group 3: DriveLaW Overview - DriveLaW is introduced as an end-to-end world model that innovatively shifts from parallel to chain structures in generation and planning [5]. - It leverages latent features from large-scale video generation models to enhance planning capabilities, ensuring consistency between generated visuals and planned trajectories [5][10]. - The model consists of two main components: DriveLaW-Video for video generation and DriveLaW-Act for trajectory planning [10]. Group 4: Performance Metrics - DriveLaW achieved a FID score of 4.6 and an FVD score of 81.3, surpassing previous world model approaches in video generation quality [35]. - In the NAVSIM benchmark, DriveLaW reached a PDMS score of 89.1 without any reinforcement learning fine-tuning, demonstrating its effectiveness in closed-loop planning [36]. Group 5: Training Strategy - A three-stage training strategy is employed to balance high-fidelity video synthesis and stable trajectory generation [34]. - The first stage focuses on learning robust motion patterns at reduced spatial resolutions, while the second stage enhances visual quality at higher resolutions [34]. - The final stage conditions the trajectory planner on the latent features from the video generator, effectively coupling generation and planning [34].
超越DriveVLA-W0!DriveLaW:世界模型表征一统生成与规划(华科&小米)
自动驾驶之心·2026-01-04 01:04