以理想汽车为例，探寻自动驾驶的「大脑」进化史 - VLA 架构解析

Core Viewpoint - The article discusses the evolution of autonomous driving paradigms, emphasizing the transition from traditional End-to-End (E2E) models to the emerging Vision-Language-Action (VLA) model, which aims to integrate perception, reasoning, and action into a unified system, addressing the limitations of previous models [1][6][45]. Group 1: Evolution of Autonomous Driving Models - Traditional End-to-End (E2E) models, known as Vision-Action (VA), are criticized for their "black box" nature, lacking explainability and leading to trust issues [3][8]. - Vision-Language Models (VLM) emerged to provide explanations but created an "action gap," as they could only interpret data without executing actions [3][6]. - The VLA model represents a revolutionary shift, combining computer vision, natural language processing, and reinforcement learning into a single, explainable system capable of both understanding and acting [6][29]. Group 2: Characteristics of VLA - A true E2E system must be a unified neural network that processes raw sensor inputs and outputs executable control signals, ensuring full differentiability for effective learning [8][9]. - VLA addresses the shortcomings of VLM by providing a fully differentiable architecture that allows for seamless backpropagation of error signals from actions back to sensory inputs [27][28]. - The VLA model eliminates the inefficiencies of the "fast-slow dual-core" systems by integrating perception, reasoning, and action into a single model, enhancing data-driven learning and iteration [25][29]. Group 3: Challenges in Autonomous Driving - Autonomous driving faces significant challenges from "long-tail scenarios," where unexpected situations arise that traditional models struggle to handle [32][34]. - The introduction of VLMs, while addressing some challenges, has led to new issues such as the "semantic gap," where the output of VLMs (text) does not directly translate to actionable control signals [36][39]. - VLA aims to resolve these issues by providing a unified framework that can effectively manage complex driving scenarios and ensure high precision in action execution [45]. Group 4: Technical Components of VLA - VLA consists of three core components: a visual encoder (V), a language encoder (L), and an action decoder (A), each playing a crucial role in the system's functionality [46][48]. - The visual encoder, primarily using ViT and its variants, translates raw sensor data into visual tokens that the language model can understand [48][50]. - The language encoder integrates visual and textual information, performing complex cross-modal reasoning to generate actionable tokens for the action decoder [62][71].