AAAI 2026 | 小鹏联合北大,专为VLA模型定制视觉token剪枝方法,让端到端自动驾驶更高效
XPENGXPENG(US:XPEV) 机器之心·2026-01-04 05:43

Core Insights - The article discusses the increasing application of VLA models in end-to-end autonomous driving systems, highlighting the challenges posed by lengthy visual tokens that significantly raise computational costs [2][8] - A new paradigm for efficient visual token pruning in autonomous driving VLA models is introduced through the paper "FastDriveVLA," co-authored by Xiaopeng Motors and Peking University [2][5] - The research proposes that visual tokens related to foreground information are more valuable than those related to background content, leading to the development of a large-scale annotated dataset, nuScenes-FG, containing 241,000 images with foreground area annotations [2][13] Summary by Sections Research Background and Issues - End-to-end autonomous driving shows great potential to transform future transportation systems, learning the entire driving process within a unified framework [6] - Existing VLA models convert visual inputs into numerous visual tokens, resulting in significant computational overhead and increased inference latency, posing challenges for real-world deployment [8] Methodology and Innovations - FastDriveVLA is a novel, reconstruction-based visual token pruning framework tailored for end-to-end autonomous driving VLA models [10] - The framework includes a lightweight, plug-and-play pruner called ReconPruner, which identifies and selects meaningful foreground visual tokens using a masked image modeling approach [16][18] - An innovative adversarial foreground-background reconstruction strategy is introduced to enhance ReconPruner's ability to distinguish between foreground and background tokens [19] Experimental Results - FastDriveVLA demonstrates state-of-the-art performance across various pruning ratios in the nuScenes open-loop planning benchmark [20][25] - When the number of visual tokens is reduced from 3,249 to 812, FastDriveVLA achieves a reduction in FLOPs by approximately 7.5 times and significantly improves CUDA inference latency [26] - The framework outperforms existing methods, particularly at a 50% pruning ratio, achieving a balanced performance across all metrics [25] Efficiency Analysis - FastDriveVLA's efficiency is highlighted by its substantial reduction in FLOPs and CUDA latency, showcasing its potential for real-time applications in autonomous driving [26][27] - At a 25% pruning rate, FastDriveVLA shows the best performance across all evaluation metrics, indicating that focusing on foreground-related visual tokens is crucial for enhancing autonomous driving performance [28]

AAAI 2026 | 小鹏联合北大,专为VLA模型定制视觉token剪枝方法,让端到端自动驾驶更高效 - Reportify