AAAI 2026 | 小鹏联合北大，专为VLA模型定制视觉token剪枝方法

Core Viewpoint - The article discusses the development of FastDriveVLA, a new framework for efficient visual token pruning in end-to-end autonomous driving systems, which significantly reduces computational costs and improves inference efficiency [1][8]. Group 1: Research Background and Problem - End-to-end autonomous driving shows great potential to transform future transportation systems, learning the entire driving process within a unified framework, thus reducing errors in information transfer between modules [7]. - Existing VLA models convert visual inputs into a large number of visual tokens, leading to significant computational overhead and increased inference latency, posing challenges for real-world deployment [7][8]. - Previous research aimed at reducing visual tokens has limitations in autonomous driving scenarios, as new designs often require retraining the entire model, and pruning strategies based on attention or similarity may retain irrelevant information [7][8]. Group 2: Methodology and Innovations - FastDriveVLA introduces a novel, reconstruction-based visual token pruning framework specifically tailored for end-to-end autonomous driving [8]. - The research team hypothesized that visual tokens related to foreground information are more valuable than those related to background content, leading to the creation of the nuScenes-FG dataset, which includes 241,000 images with foreground annotations [2][13]. - The lightweight, plug-and-play pruning tool, ReconPruner, is designed to effectively identify and select meaningful foreground visual tokens, utilizing a masked image modeling approach for pixel reconstruction [16][19]. Group 3: Experimental Results - FastDriveVLA achieved state-of-the-art (SOTA) performance in open-loop planning benchmarks on the nuScenes dataset, demonstrating significant efficiency improvements [2][20]. - When the number of visual tokens was reduced from 3,249 to 812, FastDriveVLA's FLOPs decreased by approximately 7.5 times, and it reduced prefill time by 3.7 times and decode time by 1.3 times, enhancing inference efficiency [26][27]. - The framework outperformed existing methods across various pruning ratios, particularly at a 50% pruning rate, where it maintained a balanced performance across all metrics [25][28]. Group 4: Efficiency Analysis - FastDriveVLA's efficiency was analyzed in terms of FLOPs and CUDA latency, showing a significant reduction in computational requirements while maintaining high performance [26][27]. - At a 25% pruning rate, FastDriveVLA demonstrated the best performance across all evaluation metrics, indicating that focusing on foreground-related visual tokens is crucial for enhancing autonomous driving performance [28].