视觉Token剪枝 - filings, earnings calls, financial reports, news

视觉Token剪枝

Search documents

Xin Lang Cai Jing· 2025-12-28 07:56

新浪科技讯 12月28日下午消息，近日，人工智能领域国际会议AAAI 2026公布了论文录用结果，由小鹏汽车和北京大学计算机学院多媒体信息处理全国重点实验室联合完成的论文《FastDriveVLA： Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning》成功入选。这篇论文最大的贡献在于，提出了一种专门为端到端自动驾驶VLA模型定制的、高效的视觉Token剪枝框架—— FastDriveVLA。据介绍，FastDriveVLA包含一个即插即用的视觉Token剪枝器ReconPruner。在车端模型的推理阶段， ReconPruner可直接嵌入自动驾驶VLA模型用于视觉Token的剪枝，即插即用，无需重新训练整个模型。为了辅助该剪枝器的训练，还专门构建了包含来自6个摄像头视角的24.1万个图像-掩码对的nuScenes-FG 数据集。这一大规模的自动驾驶前景分割标注数据集，可广泛用于未来自动驾驶的研究。最终，nuScenes自动驾驶数据集上的测试显示，采用这一剪枝框架，在不同剪枝率下均取得当前最 ...

VLA-Pruner：面向高效VLA推理的时序感知视觉token剪枝

具身智能之心· 2025-11-21 16:03

Group 1 - The core challenge of VLA models lies in their ability to integrate visual scene perception, natural language understanding, and action execution, which results in significant computational overhead due to the high number of visual tokens compared to text tokens [2][4]. - Existing pruning methods for visual tokens are flawed as they primarily focus on semantic relevance, neglecting the distinct needs of high-level semantic understanding and low-level action execution, leading to performance drops at high pruning rates [3][4]. - A key observation is that the temporal continuity of robot operations allows for the estimation of necessary visual tokens for current actions based on historical attention trends, providing a breakthrough in addressing the limitations of existing methods [5]. Group 2 - The VLA-Pruner is designed to retain both semantic understanding and action execution tokens under a given computational budget, achieving efficient inference without performance loss through a dual-level criterion and selection strategy [6][10]. - The dual-level importance criteria include semantic relevance based on pre-fill attention scores and action-level importance estimated through temporal smoothing, ensuring a comprehensive approach to token selection [7][9]. - The method employs a "merge-filter" mechanism to maximize relevance and minimize redundancy, ensuring that all critical tokens for both semantic understanding and action execution are preserved [10][11]. Group 3 - Experimental results demonstrate that at a 50% pruning rate, VLA-Pruner not only maintains performance but also improves success rates, with OpenVLA showing an average increase of 2.45% [16]. - The VLA-Pruner exhibits robustness across different scenarios, achieving a success rate of 96.8% in the SIMPLER environment at a 75% pruning rate, significantly outperforming baseline methods [19][20]. - Efficiency improvements are notable, with FLOPs reduced to approximately 60% of the original model at a 50% pruning rate and achieving up to 1.8 times faster inference speeds [26][27]. Group 4 - The core contributions of the study include the introduction of a dual-level pruning criterion that addresses the inherent flaws of existing methods and the design of a plug-and-play pruning framework that enhances inference efficiency without altering the model architecture [31]. - Limitations include potential inaccuracies in action attention estimation in dynamic scenarios with rapid perspective shifts or target changes, suggesting areas for future optimization [31]. - Future directions involve the development of adaptive prediction modules and the integration of additional techniques such as quantization and layer pruning to further enhance deployment efficiency [31].

面向量产VLA！FastDriveVLA：即插即用剪枝模块，推理加速近4倍

自动驾驶之心· 2025-08-23 16:03

Core Viewpoint - The article discusses the development of FastDriveVLA, a novel visual token pruning framework designed for autonomous driving, achieving a 50% compression rate while maintaining 97.3% performance [3][13][43]. Group 1: End-to-End Autonomous Driving - Recent advancements in end-to-end autonomous driving research have led to the adoption of visual-language-action (VLA) models, which outperform traditional modular approaches in complex scene understanding and decision-making [3][10]. - The VLA model integrates perception, action generation, and planning into a single framework, reducing information loss between modules [3][4]. Group 2: Visual Token Pruning Techniques - Existing VLM/VLA models face high computational costs due to the encoding of images into numerous visual tokens, prompting research into visual token pruning methods [4][11]. - Two primary approaches for visual token pruning are attention mechanism-based methods and similarity-based methods, both of which have limitations in driving tasks [4][14]. - FastDriveVLA introduces a reconstruction-based visual token pruning framework that focuses on retaining tokens related to foreground areas critical for driving decisions [5][13]. Group 3: FastDriveVLA Framework - FastDriveVLA employs a plug-and-play pruner called ReconPruner, trained using a pixel reconstruction task to emphasize foreground information [6][17]. - The framework includes an adversarial foreground-background reconstruction strategy to enhance the model's ability to distinguish between foreground and background tokens [20][21]. - A large-scale dataset, nuScenes-FG, was constructed to support the training of ReconPruner, containing 241,000 image-mask pairs for effective foreground segmentation [6][12][13]. Group 4: Experimental Results - FastDriveVLA achieved state-of-the-art results on the nuScenes closed-loop planning benchmark, demonstrating its effectiveness and practicality [13][28]. - The framework was evaluated under various pruning ratios (25%, 50%, 75%), consistently outperforming existing methods in key metrics such as L2 error and collision rates [30][34]. - The efficiency analysis showed that FastDriveVLA significantly reduces FLOPs and CUDA latency compared to other methods, enhancing real-time deployment capabilities [36][40]. Group 5: Contributions and Implications - The introduction of FastDriveVLA provides a new paradigm for efficient inference in VLA models, offering insights into task-specific token pruning strategies [43]. - The research highlights the importance of focusing on foreground information in autonomous driving tasks, which can lead to improved performance and reduced computational costs [5][43].