没有标题党, 理想系统性重构语言-动作模型

Core Viewpoint - The article discusses the core obstacles in the implementation of VLA (Vision-Language Action) systems, emphasizing that existing solutions treat alignment as a defect to be patched rather than a structural issue to be eliminated from the architecture [1]. Group 1: Shared Codebook - LinkVLA proposes a Shared Codebook that eliminates the need for translation between human language and vehicle action coordinates, addressing the inherent loss in translation without direct supervision [2][3]. - By transforming continuous trajectory coordinates into discrete action tokens and merging them with language tokens, LinkVLA creates a unified representation that removes the modal gap at a structural level [3]. Group 2: Action Understanding Objective - LinkVLA introduces an Action Understanding Objective that requires the model to not only generate trajectories from language commands but also to reverse-engineer language descriptions from existing trajectories, enhancing the model's reliability [4]. - The dual-task approach significantly improves performance metrics, with the average success rate increasing from 81.63% to 87.16% and lane change success rate rising from 88.49% to 97.42% [4]. Group 3: C2F Architecture - The Coarse-to-Fine (C2F) architecture in LinkVLA reduces the inference time from 361ms to 48ms by compressing the serial dependency of trajectory generation into two steps, thus enhancing real-time performance [5][6]. - This architecture not only improves efficiency but also maintains accuracy, with driving scores increasing from 90.66 to 91.01, demonstrating a simultaneous enhancement in speed and precision [6]. Group 4: Systematic Reconstruction - The contributions of Shared Codebook, Action Understanding, and C2F collectively represent a systematic reconstruction of the underlying architecture of language-action models, rather than mere local optimizations [7].