Core Insights - The article discusses the transition of autonomous driving technology from "perception-planning" to an end-to-end Vision-Language-Action (VLA) paradigm, highlighting the significance of world models and generative simulation in this evolution [2][3]. Group 1: Technological Evolution - The review article from Imperial College London systematically analyzes 77 cutting-edge papers up to September 2025, focusing on three main dimensions: end-to-end VLA, world models, and modular integration, providing a comprehensive learning roadmap for developers [2]. - The emergence of VLA signifies a shift from simple multi-modal fusion to a collaborative reasoning flow between vision and language, directly outputting planning trajectories [10]. - The article emphasizes the importance of world models in leveraging generative AI to address corner cases in autonomous driving [6]. Group 2: Modular Integration - Despite the popularity of end-to-end architectures, modular solutions are experiencing a resurgence, demonstrating the potential of large models in traditional perception stacks, such as semantic anomaly detection and long-tail object recognition [7]. - The review highlights models like Talk2BEV and ChatBEV that utilize Vision-Language Models (VLM) for enhanced perception capabilities [7]. Group 3: Challenges and Solutions - The article identifies three major challenges facing VLM deployment in autonomous vehicles: reasoning latency, hallucinations, and computational trade-offs [9][13]. - Solutions discussed include visual token compression, chain-of-thought pruning, and optimization strategies for NVIDIA OrinX chips to address latency issues [12]. - To mitigate hallucination problems, techniques like "hallucination subspace projection" and rule-based safety filters are proposed [15]. Group 4: Future Directions - The review outlines four unresolved challenges in the field: standardized evaluation, edge deployment, multi-modal alignment, and legal and ethical considerations [17]. - It emphasizes the need for a unified scoring system for VLA safety and hallucination rates, as well as the importance of ensuring semantic consistency across different modalities in complex scenarios [17]. Group 5: Resource Compilation - The paper includes nine detailed classification tables and a review of key datasets and simulation platforms, such as NuScenes-QA and CARLA, to support community research and highlight the transition from open-loop metrics to closed-loop evaluations [14][16].
帝国理工VLA综述:从世界模型到VLA,如何重构自动驾驶(T-ITS)
自动驾驶之心·2026-01-05 00:35