博世一篇最新的端到端世界模型工作：统一理解、规划和生成

Core Viewpoint - The article presents UniDrive-WM, a unified world model based on visual-language models (VLMs) that integrates scene understanding, trajectory planning, and future image generation within a single architecture, addressing the information bottleneck caused by the separation of perception, prediction, and planning modules in traditional methods [1][2][7]. Group 1: Background and Motivation - Recent advancements in multi-modal large language models (MLLMs) have been driven by the strong perception, reasoning, and instruction-following capabilities of visual-language models (VLMs) [4]. - The development of visual generation technologies along two complementary paths—autoregressive (AR) token prediction and diffusion-based continuous generation—has enabled high-fidelity image synthesis across diverse tasks [4] [8]. Group 2: Methodology - UniDrive-WM employs a visual-language model at its core to jointly achieve scene understanding, trajectory planning, and future image generation, facilitating direct visual reasoning from spatiotemporal observations [19]. - The trajectory planner predicts future trajectories conditioned on the visual-language model's output, establishing a differentiable connection between the reasoning space and the numerical action space [20]. - Two complementary decoding paradigms for future image prediction are developed: discrete autoregressive (AR) paths and continuous autoregressive + diffusion (AR+Diffusion) paths, revealing their respective advantages and trade-offs in autonomous driving scenarios [19][22]. Group 3: Experimental Results - In the Bench2Drive benchmark, UniDrive-WM demonstrated a 5.9% improvement in L2 trajectory error and a 9.2% reduction in collision rates compared to previous optimal methods, validating the advantages of tightly integrating reasoning, planning, and generative world modeling for autonomous driving [2][9]. - The model's performance was evaluated on both open-loop and closed-loop metrics, showing superior results compared to traditional end-to-end methods and visual-language model-guided planning approaches [43][44]. Group 4: Conclusion and Future Work - UniDrive-WM successfully integrates scene understanding, trajectory planning, and visual generation into a single framework, enhancing trajectory planning performance through visual predictions of expected future scenes [54]. - Future plans include expanding this framework to more interactive and long-term driving scenarios, laying the groundwork for the next generation of autonomous driving world models [54].