Workflow
token pruning
icon
Search documents
NeurIPS'25!AutoPrune:即插即用的自适应大模型剪枝框架
自动驾驶之心· 2025-10-07 07:46
Core Insights - The article discusses the introduction of AutoPrune, a training-free complexity-adaptive pruning framework designed to alleviate computational burdens in Visual Language Models (VLMs) by quantifying task complexity through mutual information between visual and textual tokens [3][18]. Background Review - Visual language models are central to multimodal systems, supporting tasks like image description and visual question answering (VQA). The coupling of perception and control in frameworks like VLA for autonomous driving leads to significant memory and latency bottlenecks due to high-resolution images being converted into numerous visual tokens [4][18]. - Previous methods typically employed fixed layer-wise pruning strategies, which lack global budget constraints and require manual tuning, making them less adaptable to varying task complexities [4][11]. Key Contributions - AutoPrune models visual token pruning as a constrained optimization problem under a global computational budget, optimizing three strategies: layer-wise token allocation, token selection, and token recovery [9][10]. - The complexity metric is derived from cross-modal attention, directly calculating mutual information to characterize sample difficulty and task complexity [10][13]. - The framework is designed to be plug-and-play, requiring no training and demonstrating superior performance across various datasets and pruning ratios compared to existing training-free methods [10][11]. Experimental Results - In experiments with LLaVA-1.5-7B, retaining 64 tokens maintained 96.7% of original accuracy while reducing FLOPs to 23.2%, indicating minimal loss under moderate pruning [14]. - LLaVA-NeXT-7B outperformed comparative methods across different token retention budgets, retaining 94.9% performance at a budget of 160 tokens [15]. - The results show that AutoPrune effectively supports real-time multimodal inference and embodied intelligence, revealing nuanced differences in attention distribution that align with cognitive neuroscience observations [18].