视觉Token剪枝
Search documents
VLA-Pruner:面向高效VLA推理的时序感知视觉token剪枝
具身智能之心· 2025-11-21 16:03
作者丨 Ziyan Liu等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 点击下方 卡片 ,关注" 具身智能 之心 "公众号 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、研究背景与核心挑战 1. VLA模型的价值与部署瓶颈 视觉-语言-动作(VLA)模型是具身智能的核心方向,能够整合视觉场景感知、自然语言理解与底层动作执行,在多样化机器人任务中展现出强大泛化能力。但这 类模型需要处理连续视觉流,而视觉Token数量通常是文本Token的一个数量级(如256个视觉Token vs 30-50个文本Token),导致计算开销巨大,严重限制实时部 署。 2. 现有剪枝方法的核心缺陷 现有视觉Token剪枝方法(如FastV、SparseVLM、DivPrune)均基于视觉-语言模型(VLM)设计,仅依赖预填充阶段的语义显著性指标(如文本-视觉交叉注意 力、特征多样性)筛选Token。但VLA模型存在 双系统本质 :高层语义理解(任务规划)与底层动作执行(精准操控)对视觉信息的需求截然不同 ...
面向量产VLA!FastDriveVLA:即插即用剪枝模块,推理加速近4倍
自动驾驶之心· 2025-08-23 16:03
Core Viewpoint - The article discusses the development of FastDriveVLA, a novel visual token pruning framework designed for autonomous driving, achieving a 50% compression rate while maintaining 97.3% performance [3][13][43]. Group 1: End-to-End Autonomous Driving - Recent advancements in end-to-end autonomous driving research have led to the adoption of visual-language-action (VLA) models, which outperform traditional modular approaches in complex scene understanding and decision-making [3][10]. - The VLA model integrates perception, action generation, and planning into a single framework, reducing information loss between modules [3][4]. Group 2: Visual Token Pruning Techniques - Existing VLM/VLA models face high computational costs due to the encoding of images into numerous visual tokens, prompting research into visual token pruning methods [4][11]. - Two primary approaches for visual token pruning are attention mechanism-based methods and similarity-based methods, both of which have limitations in driving tasks [4][14]. - FastDriveVLA introduces a reconstruction-based visual token pruning framework that focuses on retaining tokens related to foreground areas critical for driving decisions [5][13]. Group 3: FastDriveVLA Framework - FastDriveVLA employs a plug-and-play pruner called ReconPruner, trained using a pixel reconstruction task to emphasize foreground information [6][17]. - The framework includes an adversarial foreground-background reconstruction strategy to enhance the model's ability to distinguish between foreground and background tokens [20][21]. - A large-scale dataset, nuScenes-FG, was constructed to support the training of ReconPruner, containing 241,000 image-mask pairs for effective foreground segmentation [6][12][13]. Group 4: Experimental Results - FastDriveVLA achieved state-of-the-art results on the nuScenes closed-loop planning benchmark, demonstrating its effectiveness and practicality [13][28]. - The framework was evaluated under various pruning ratios (25%, 50%, 75%), consistently outperforming existing methods in key metrics such as L2 error and collision rates [30][34]. - The efficiency analysis showed that FastDriveVLA significantly reduces FLOPs and CUDA latency compared to other methods, enhancing real-time deployment capabilities [36][40]. Group 5: Contributions and Implications - The introduction of FastDriveVLA provides a new paradigm for efficient inference in VLA models, offering insights into task-specific token pruning strategies [43]. - The research highlights the importance of focusing on foreground information in autonomous driving tasks, which can lead to improved performance and reduced computational costs [5][43].