全球首个自动驾驶VLA综述重磅发布：VLA自驾模型全面拆解~

Core Insights - The article discusses the integration of vision, language, and action in autonomous driving through the Vision-Language-Action (VLA) model, highlighting its potential to enhance the capabilities of self-driving vehicles [1][3]. Evolution of Autonomous Driving Paradigms - The development of autonomous driving technology has transitioned from modular to integrated approaches, categorized into three core paradigms: 1. End-to-End Autonomous Driving (AD) which directly maps sensor inputs to driving actions but lacks interpretability [3]. 2. Vision Language Models (VLMs for AD) that enhance system interpretability and generalization but do not directly control vehicle actions [3]. 3. Vision-Language-Action Models (VLA for AD) that unify perception, reasoning, and action execution, enabling vehicles to understand complex instructions and make autonomous decisions [3][4]. VLA4AD Architecture - A typical VLA4AD model consists of three parts: input, processing, and output, integrating environmental perception, high-level instruction understanding, and vehicle control [5]. - The architecture includes multimodal inputs, core modules for processing visual and language data, and an action decoder for generating control outputs [6][7][9]. Development Stages of VLA Models - The evolution of VLA models is divided into four stages: 1. Language models as explainers, enhancing interpretability without direct control [16]. 2. Modular VLA models where language actively contributes to planning decisions [19]. 3. Unified end-to-end VLA models that map sensor inputs to control signals in a single forward pass [20]. 4. Reasoning-augmented VLA models that incorporate long-term reasoning and memory into decision-making [21]. Representative VLA4AD Models - The article provides a detailed comparison of various VLA4AD models, highlighting their inputs, outputs, datasets, and core contributions [23]. Examples include: - DriveGPT-4, which utilizes a single image input to generate high-level control labels [22]. - ADriver-I, which integrates vision-action tokens for control [22]. - RAG-Driver, which employs retrieval-augmented control mechanisms [22]. Datasets and Benchmarks - High-quality, diverse datasets are crucial for VLA4AD development, with notable datasets including BDD100K, nuScenes, and Bench2Drive, which provide rich annotations for training and evaluation [25][26][29]. Challenges and Future Directions - The article outlines six major challenges facing VLA4AD, including robustness, real-time performance, data bottlenecks, and multimodal alignment [31][32]. - Future directions include the development of foundation-scale driving models, neuro-symbolic safety kernels, fleet-scale continual learning, standardized traffic language, and cross-modal social intelligence [36][37].