Core Insights - The paper titled "FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning" has been accepted at the AAAI 2026 conference, showcasing a novel visual token pruning framework specifically designed for end-to-end autonomous driving VLA models [1][8] - FastDriveVLA introduces a plug-and-play visual token pruner called ReconPruner, which can be directly integrated into the autonomous driving VLA model during inference without the need for retraining the entire model [1][8] - A large-scale dataset, nuScenes-FG, consisting of 241,000 image-mask pairs from six camera perspectives, was created to assist in training the pruning mechanism, which can be widely used for future autonomous driving research [1][4] Performance Metrics - Testing on the nuScenes autonomous driving dataset demonstrated that the pruning framework achieved state-of-the-art (SOTA) results at various pruning rates: at a 25% pruning rate, driving performance remained nearly unchanged, with L2 trajectory error and collision rates surpassing those of the unpruned baseline model [2][9] - At a 50% pruning rate, the model exhibited balanced performance across all metrics, while also significantly enhancing the inference efficiency of the VLA model [2][9] Technical Innovations - The FastDriveVLA framework is inspired by human driving behavior, focusing on selective information processing to discard redundant visual tokens while retaining critical ones [3][11] - The framework employs a foreground-background adversarial reconstruction strategy to differentiate between essential and non-essential visual tokens, thereby optimizing the model's performance [3][11] - The visual token pruner, ReconPruner, has a parameter count of only 0.07 billion and is adaptable to various VLA models, showcasing its efficiency [4][12] Efficiency Improvements - Comparative analysis of different pruning methods revealed that FastDriveVLA outperformed existing techniques at pruning rates of 25%, 50%, and 75%, achieving SOTA results [4][13] - When the initial number of input tokens was reduced from 3,249 to 812, FastDriveVLA's FLOPs decreased by approximately 7.5 times, and CUDA inference latency was significantly improved, with prefill time accelerated by 3.7 times and decoding time by 1.3 times [4][13][6]
小鹏联合北大提出全新视觉Token剪枝框架,何小鹏:在探索L4路上又取得新突破