NAVSIM SOTA！LatentVLA：通过潜在动作预测构建高效自驾VLA（OpenDriveLab&理想）

Core Insights - The article discusses the introduction of LatentVLA, a new framework that integrates Vision-Language Models (VLMs) with traditional end-to-end methods for autonomous driving, achieving state-of-the-art performance in trajectory prediction [2][31][52]. Group 1: Background and Challenges - Recent advancements in end-to-end autonomous driving methods have shown impressive performance when trained on large human driving datasets, but they still face fundamental challenges due to the limited diversity of training data compared to real-world traffic conditions [4][10]. - Key challenges identified include: 1. Insensitivity in trajectory prediction and imprecision in outputs due to the discrete nature of language models [5]. 2. The burden of data annotation and language bias that limits the capture of implicit driving knowledge [5]. 3. Low computational efficiency and cognitive misalignment in VLMs, which often rely on multi-step reasoning that is time-consuming [5][6]. Group 2: LatentVLA Framework - LatentVLA proposes a self-supervised latent action prediction approach that allows VLMs to learn rich driving representations from unannotated trajectory data, alleviating language bias and reducing annotation costs [21][22]. - The framework employs knowledge distillation to transfer the learned representations and reasoning capabilities from the VLM to traditional end-to-end trajectory prediction networks, maintaining computational efficiency and numerical accuracy [21][22]. Group 3: Performance and Results - LatentVLA achieved a PDMS score of 92.4 on the NAVSIM benchmark, establishing a new state-of-the-art performance, and demonstrated strong zero-shot generalization capabilities on the nuScenes benchmark [31][41]. - The integration of VLM features significantly improved performance compared to baseline methods, with notable enhancements in trajectory planning accuracy [41][42]. Group 4: Experimental Analysis - The article presents a comprehensive analysis of the experimental results, showing that the distilled version of LatentVLA maintains competitive performance while significantly reducing inference latency, achieving a frame rate increase from 1.27 FPS to 4.82 FPS [52]. - The zero-shot performance on nuScenes was competitive, with an average L2 error of 0.33m, indicating strong cross-dataset generalization capabilities [44][45]. Group 5: Conclusion - LatentVLA effectively addresses three critical challenges in autonomous driving VLMs: insensitivity in trajectory prediction, reliance on language annotations, and low computational efficiency, providing a promising paradigm for leveraging pre-trained VLMs in real-world autonomous driving applications [52].