详解「端到端」下一代模型VLA，通向自动驾驶的关键跳板

Core Viewpoints - The latest evolution in intelligent driving technology is the integration of multimodal large models, particularly the Vision-Language-Action (VLA) model, which is seen as the next generation of "end-to-end" solutions [1][2] - VLA models, which combine end-to-end capabilities with multimodal large models, are expected to significantly enhance the understanding and generalization abilities of intelligent driving systems, potentially serving as a key bridge from L2 to L4 autonomous driving [2][4] - Companies like Waymo and Li Auto are already exploring VLA models, with Li Auto initiating pre-research on L4 autonomous driving and integrating VLA models with cloud-based world models [3][4] Industry Trends - The intelligent driving industry is rapidly shifting from rule-based algorithms to AI-driven "end-to-end" solutions, which offer higher performance ceilings and better adaptability to complex urban traffic scenarios [2] - The integration of large language models (LLMs) and visual language models (VLMs) with end-to-end systems is becoming a trend, with companies like Li Auto adopting end-to-end + VLM solutions [4] - VLA models represent a more integrated approach, where multimodal large models are no longer external add-ons but intrinsic capabilities of the end-to-end system [4] Technological Advancements - VLA models, originally developed in the robotics industry, are now being applied to intelligent driving, offering enhanced scene reasoning and generalization capabilities [1][2] - The VLA model is considered a 2.0 version of end-to-end systems, capable of handling complex traffic rules, tidal lanes, and long-sequence reasoning better than previous models [5] - Traditional rule-based systems can only reason about 1 second of road conditions, while end-to-end 1.0 systems can reason up to 7 seconds, and VLA models can reason for several tens of seconds [5] Challenges and Barriers - The deployment of VLA models faces significant challenges, particularly in terms of hardware requirements, as current vehicle chips lack the necessary computational power to support these models [6] - NVIDIA's next-generation AI chip, Thor, with a single-chip AI computing power of 1000 Tops, is expected to address some of these challenges, but its release may be delayed, and cost remains a concern [6] - The integration of end-to-end systems with multimodal large models requires advanced model framework definition and rapid iteration capabilities, which are not yet widely available in the industry [7] Company Initiatives - Li Auto has started pre-research on L4 autonomous driving and is developing a VLA model combined with a cloud-based world model [3] - Yuanrong Qixing, after receiving a 700 million RMB investment from Great Wall Motors, plans to develop a VLA model based on NVIDIA's latest Thor chip, with the model expected to launch in 2025 [4] - Waymo has introduced an end-to-end autonomous driving multimodal model, EMMA, which is considered a VLA model architecture, integrating visual, language, and action capabilities [2]