V²Drop
Search documents
CVPR 2026 | 从视觉Token内在变化量出发,实现VLM无损加速1.87倍
机器之心· 2026-03-15 06:00
Background and Motivation - The demand for high-resolution image understanding and long video processing has surged, leading to a significant increase in the number of visual tokens that large vision-language models (LVLMs) need to process. This has made inference efficiency a core bottleneck for deployment [5] - Existing token compression methods rely on attention weights to determine token importance, which has two critical flaws: position bias and incompatibility with efficient operators [5][7] Core Findings - Finding 1: Attention methods exhibit systematic end bias, with retention rates for end tokens reaching 80%-100% while front tokens only retain 10%-30%, showing no correlation with content importance. In contrast, L2 Norm shows a near-uniform distribution, avoiding position bias [8][10] - Finding 2: Tokens with high variation correspond to semantically important areas. Various metrics (L1 Norm, L2 Norm, and cosine similarity) show significant peaks in relevant areas, indicating that variation is a robust intrinsic property for measuring visual token importance, with L2 Norm being the optimal metric [12][14] Solution: V²Drop - V²Drop employs a multi-stage progressive pruning strategy during LLM inference to achieve efficient and unbiased token compression. The process includes: 1. Variation computation to calculate the L2 distance of each visual token to the previous layer as an importance score, with negligible additional overhead [15] 2. Token ranking and selection based on variation scores to retain the top K tokens, filtering out inactive tokens without introducing position bias [16] 3. Progressive dropping executed in shallow, mid, and deep layers, demonstrating superior performance compared to one-time pruning [18] Experimental Results - Image Understanding: In LLaVA-1.5-7B, V²Drop compresses 66.7% of tokens (retaining 192) while achieving a performance of 97.6%, surpassing the next best method, PDrop, at 96.0% [23] - Video Understanding: V²Drop retains only 25% of tokens while achieving a performance of 98.6%, outperforming DyCoke at 97.7% [25] - Efficiency Analysis: V²Drop significantly reduces LLM generation latency by 31.5% in image tasks and 74.2% in video tasks, while increasing throughput and reducing peak memory usage [27][28] Conclusion - V²Drop opens a new path for accelerating inference in visual language models by leveraging the relationship between token variation and task relevance. This framework is lightweight, progressive, and fully compatible with efficient operators, achieving optimal performance in both image and video understanding tasks [31]