自动驾驶是否一定需要语言模型?
自动驾驶之心·2025-11-05 00:04

Core Viewpoint - The article discusses the technological competition between two architectures for autonomous driving: WEWA (World Engine + World Action Model) represented by Huawei and VLA (Vision-Language-Action) pursued by companies like Li Auto and Xpeng. It highlights the debate on whether large language models (LLMs) are essential for autonomous driving, emphasizing the trade-offs between efficiency and cognitive depth in technology choices [2][4]. Summary by Sections 1. Technological Divergence: WEWA vs. VLA - The year 2025 is identified as a critical turning point for autonomous driving technology, with WEWA and VLA architectures representing opposing approaches. WEWA aims for efficient implementation through "de-linguistic" methods, while VLA focuses on cognitive intelligence via language models [2][4]. 2. Fundamental Differences Between WEWA and VLA - The two architectures differ fundamentally in their information processing logic, core components, and technical goals, particularly regarding the role of language as an intermediary. WEWA emphasizes direct mapping from visual data to actions, while VLA incorporates a three-tiered process involving visual features, language semantics, and control instructions [5][6]. 3. Cost of Language Models - VLA's reliance on large language models incurs significant computational costs, presenting a core bottleneck for mass production. The hardware costs escalate dramatically due to the need for high-performance GPU clusters during training and advanced chips for real-time inference [7][8][9]. 4. Advantages of Language Models - Despite high computational costs, VLA's rise is attributed to the abstracting capabilities and cognitive intelligence provided by language models. These models can compress numerous similar scenarios into concise language, enhancing decision-making in complex situations [10][12][13][14]. 5. Core Trade-offs: Efficiency vs. Intelligence - The necessity of language models in autonomous driving is debated, with no definitive conclusion. In short-term production scenarios (L2-L3), WEWA's efficiency and low latency are more valuable, while in long-term high-level scenarios (L4-L5), VLA's cognitive advantages become essential. The future may see a hybrid approach combining both architectures to balance efficiency and intelligence [15][16][17][18].