上交最新！端到端&VLA综述：广义范式下的统一视角

Core Viewpoint - The article discusses the evolution of autonomous driving technology, emphasizing the need for a unified perspective on various paradigms, including end-to-end (E2E), VLM-centric, and hybrid approaches, to enhance understanding and performance in complex driving scenarios [2][4][14]. Group 1: Introduction and Background - Traditional modular approaches in autonomous driving have led to information loss and error accumulation due to task fragmentation, prompting a shift towards data-driven end-to-end architectures [5][10]. - The article introduces a comprehensive review titled "Survey of General End-to-End Autonomous Driving: A Unified Perspective," which aims to bridge the gap in understanding between different paradigms [3][4]. Group 2: Paradigms of Autonomous Driving - General End-to-End (GE2E) is defined as any model that processes raw sensor inputs into planning trajectories or control actions, regardless of whether it includes visual-language models (VLM) [4][14]. - The three main paradigms unified under GE2E are: - Traditional End-to-End (Conventional E2E), which relies on structured scene representation for precise trajectory planning [9][17]. - VLM-centric End-to-End, which utilizes pre-trained visual-language models to enhance generalization and reasoning capabilities in complex scenarios [11][33]. - Hybrid End-to-End, which combines the strengths of both traditional and VLM-centric approaches to balance high-level semantic understanding with low-level control precision [12][39]. Group 3: Performance Comparison - In open-loop performance tests, the hybrid paradigm outperformed others, demonstrating the importance of world knowledge in handling long-tail scenarios [54]. - Traditional E2E methods still dominate in numerical trajectory prediction accuracy, indicating their robustness in structured environments [54]. - In closed-loop performance, traditional methods maintain a stronghold, particularly in complex driving tasks, while VLA methods show potential but require further refinement in fine-grained trajectory control [55][56]. Group 4: Data and Learning Strategies - The evolution of datasets from geometric annotations to semantic-rich datasets is crucial for training models capable of logical reasoning and understanding complex traffic contexts [46][48]. - The introduction of Chain of Thought (CoT) annotations in datasets supports advanced reasoning tasks, moving beyond simple input-output mappings [47]. Group 5: Model Architecture and Details - The article provides a detailed comparison of mainstream model architectures, including their inputs, backbone networks, intermediate tasks, and output forms, to clarify the distinctions among different paradigms [57].