Workflow
视觉Token剪枝
icon
Search documents
小鹏联合北大提出全新视觉Token剪枝框架,何小鹏:在探索L4路上又取得新突破
Xin Lang Cai Jing· 2025-12-28 07:56
Core Insights - The paper titled "FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning" has been accepted at the AAAI 2026 conference, showcasing a novel visual token pruning framework specifically designed for end-to-end autonomous driving VLA models [1][8] - FastDriveVLA introduces a plug-and-play visual token pruner called ReconPruner, which can be directly integrated into the autonomous driving VLA model during inference without the need for retraining the entire model [1][8] - A large-scale dataset, nuScenes-FG, consisting of 241,000 image-mask pairs from six camera perspectives, was created to assist in training the pruning mechanism, which can be widely used for future autonomous driving research [1][4] Performance Metrics - Testing on the nuScenes autonomous driving dataset demonstrated that the pruning framework achieved state-of-the-art (SOTA) results at various pruning rates: at a 25% pruning rate, driving performance remained nearly unchanged, with L2 trajectory error and collision rates surpassing those of the unpruned baseline model [2][9] - At a 50% pruning rate, the model exhibited balanced performance across all metrics, while also significantly enhancing the inference efficiency of the VLA model [2][9] Technical Innovations - The FastDriveVLA framework is inspired by human driving behavior, focusing on selective information processing to discard redundant visual tokens while retaining critical ones [3][11] - The framework employs a foreground-background adversarial reconstruction strategy to differentiate between essential and non-essential visual tokens, thereby optimizing the model's performance [3][11] - The visual token pruner, ReconPruner, has a parameter count of only 0.07 billion and is adaptable to various VLA models, showcasing its efficiency [4][12] Efficiency Improvements - Comparative analysis of different pruning methods revealed that FastDriveVLA outperformed existing techniques at pruning rates of 25%, 50%, and 75%, achieving SOTA results [4][13] - When the initial number of input tokens was reduced from 3,249 to 812, FastDriveVLA's FLOPs decreased by approximately 7.5 times, and CUDA inference latency was significantly improved, with prefill time accelerated by 3.7 times and decoding time by 1.3 times [4][13][6]
VLA-Pruner:面向高效VLA推理的时序感知视觉token剪枝
具身智能之心· 2025-11-21 16:03
Group 1 - The core challenge of VLA models lies in their ability to integrate visual scene perception, natural language understanding, and action execution, which results in significant computational overhead due to the high number of visual tokens compared to text tokens [2][4]. - Existing pruning methods for visual tokens are flawed as they primarily focus on semantic relevance, neglecting the distinct needs of high-level semantic understanding and low-level action execution, leading to performance drops at high pruning rates [3][4]. - A key observation is that the temporal continuity of robot operations allows for the estimation of necessary visual tokens for current actions based on historical attention trends, providing a breakthrough in addressing the limitations of existing methods [5]. Group 2 - The VLA-Pruner is designed to retain both semantic understanding and action execution tokens under a given computational budget, achieving efficient inference without performance loss through a dual-level criterion and selection strategy [6][10]. - The dual-level importance criteria include semantic relevance based on pre-fill attention scores and action-level importance estimated through temporal smoothing, ensuring a comprehensive approach to token selection [7][9]. - The method employs a "merge-filter" mechanism to maximize relevance and minimize redundancy, ensuring that all critical tokens for both semantic understanding and action execution are preserved [10][11]. Group 3 - Experimental results demonstrate that at a 50% pruning rate, VLA-Pruner not only maintains performance but also improves success rates, with OpenVLA showing an average increase of 2.45% [16]. - The VLA-Pruner exhibits robustness across different scenarios, achieving a success rate of 96.8% in the SIMPLER environment at a 75% pruning rate, significantly outperforming baseline methods [19][20]. - Efficiency improvements are notable, with FLOPs reduced to approximately 60% of the original model at a 50% pruning rate and achieving up to 1.8 times faster inference speeds [26][27]. Group 4 - The core contributions of the study include the introduction of a dual-level pruning criterion that addresses the inherent flaws of existing methods and the design of a plug-and-play pruning framework that enhances inference efficiency without altering the model architecture [31]. - Limitations include potential inaccuracies in action attention estimation in dynamic scenarios with rapid perspective shifts or target changes, suggesting areas for future optimization [31]. - Future directions involve the development of adaptive prediction modules and the integration of additional techniques such as quantization and layer pruning to further enhance deployment efficiency [31].
面向量产VLA!FastDriveVLA:即插即用剪枝模块,推理加速近4倍
自动驾驶之心· 2025-08-23 16:03
Core Viewpoint - The article discusses the development of FastDriveVLA, a novel visual token pruning framework designed for autonomous driving, achieving a 50% compression rate while maintaining 97.3% performance [3][13][43]. Group 1: End-to-End Autonomous Driving - Recent advancements in end-to-end autonomous driving research have led to the adoption of visual-language-action (VLA) models, which outperform traditional modular approaches in complex scene understanding and decision-making [3][10]. - The VLA model integrates perception, action generation, and planning into a single framework, reducing information loss between modules [3][4]. Group 2: Visual Token Pruning Techniques - Existing VLM/VLA models face high computational costs due to the encoding of images into numerous visual tokens, prompting research into visual token pruning methods [4][11]. - Two primary approaches for visual token pruning are attention mechanism-based methods and similarity-based methods, both of which have limitations in driving tasks [4][14]. - FastDriveVLA introduces a reconstruction-based visual token pruning framework that focuses on retaining tokens related to foreground areas critical for driving decisions [5][13]. Group 3: FastDriveVLA Framework - FastDriveVLA employs a plug-and-play pruner called ReconPruner, trained using a pixel reconstruction task to emphasize foreground information [6][17]. - The framework includes an adversarial foreground-background reconstruction strategy to enhance the model's ability to distinguish between foreground and background tokens [20][21]. - A large-scale dataset, nuScenes-FG, was constructed to support the training of ReconPruner, containing 241,000 image-mask pairs for effective foreground segmentation [6][12][13]. Group 4: Experimental Results - FastDriveVLA achieved state-of-the-art results on the nuScenes closed-loop planning benchmark, demonstrating its effectiveness and practicality [13][28]. - The framework was evaluated under various pruning ratios (25%, 50%, 75%), consistently outperforming existing methods in key metrics such as L2 error and collision rates [30][34]. - The efficiency analysis showed that FastDriveVLA significantly reduces FLOPs and CUDA latency compared to other methods, enhancing real-time deployment capabilities [36][40]. Group 5: Contributions and Implications - The introduction of FastDriveVLA provides a new paradigm for efficient inference in VLA models, offering insights into task-specific token pruning strategies [43]. - The research highlights the importance of focusing on foreground information in autonomous driving tasks, which can lead to improved performance and reduced computational costs [5][43].