Token压缩
Search documents
CVPR 2026 | 从视觉Token内在变化量出发,实现VLM无损加速1.87倍
机器之心· 2026-03-15 06:00
Background and Motivation - The demand for high-resolution image understanding and long video processing has surged, leading to a significant increase in the number of visual tokens that large vision-language models (LVLMs) need to process. This has made inference efficiency a core bottleneck for deployment [5] - Existing token compression methods rely on attention weights to determine token importance, which has two critical flaws: position bias and incompatibility with efficient operators [5][7] Core Findings - Finding 1: Attention methods exhibit systematic end bias, with retention rates for end tokens reaching 80%-100% while front tokens only retain 10%-30%, showing no correlation with content importance. In contrast, L2 Norm shows a near-uniform distribution, avoiding position bias [8][10] - Finding 2: Tokens with high variation correspond to semantically important areas. Various metrics (L1 Norm, L2 Norm, and cosine similarity) show significant peaks in relevant areas, indicating that variation is a robust intrinsic property for measuring visual token importance, with L2 Norm being the optimal metric [12][14] Solution: V²Drop - V²Drop employs a multi-stage progressive pruning strategy during LLM inference to achieve efficient and unbiased token compression. The process includes: 1. Variation computation to calculate the L2 distance of each visual token to the previous layer as an importance score, with negligible additional overhead [15] 2. Token ranking and selection based on variation scores to retain the top K tokens, filtering out inactive tokens without introducing position bias [16] 3. Progressive dropping executed in shallow, mid, and deep layers, demonstrating superior performance compared to one-time pruning [18] Experimental Results - Image Understanding: In LLaVA-1.5-7B, V²Drop compresses 66.7% of tokens (retaining 192) while achieving a performance of 97.6%, surpassing the next best method, PDrop, at 96.0% [23] - Video Understanding: V²Drop retains only 25% of tokens while achieving a performance of 98.6%, outperforming DyCoke at 97.7% [25] - Efficiency Analysis: V²Drop significantly reduces LLM generation latency by 31.5% in image tasks and 74.2% in video tasks, while increasing throughput and reducing peak memory usage [27][28] Conclusion - V²Drop opens a new path for accelerating inference in visual language models by leveraging the relationship between token variation and task relevance. This framework is lightweight, progressive, and fully compatible with efficient operators, achieving optimal performance in both image and video understanding tasks [31]
关于端侧大模型芯片化的若干趋势思考......
自动驾驶之心· 2025-10-23 00:04
Core Insights - The article discusses the evolution of algorithms in the chip design industry, particularly focusing on the advancements in attention mechanisms and their implications for future chip designs [2][4]. Group 1: Attention Mechanism Evolution - The Transformer architecture has dominated the large model field, but its self-attention mechanism poses significant computational challenges, especially in terms of power requirements during the prefill and decode phases [4]. - Various improvements to the Transformer structure have been proposed, such as Performer, Reformer, and lnformer, but none have achieved widespread application due to a lack of strong demand [4]. - The emergence of linear attention mechanisms aims to reduce computational complexity to linear levels, with models like RWKV and Mamba following this approach [5]. Group 2: Dynamic Sparsity and MoE Technology - Dynamic sparsity, particularly through Mixture of Experts (MoE) technology, has gained traction, allowing only a subset of experts to be activated during inference, which can lead to better performance and reduced computational costs [8]. - The trend towards increased sparsity in MoE models, such as Ant Group's recent models, indicates a significant shift in the industry, necessitating larger memory and bandwidth requirements [9]. Group 3: Low-Bit Quantization - The introduction of low-bit quantization techniques, such as FP8 training, has opened new avenues for model efficiency, with a focus on weight-only quantization to alleviate bandwidth bottlenecks [11]. - The article highlights the importance of fine-grained quantization and the potential for mixed quantization strategies to optimize model performance, especially in MoE models [12]. Group 4: Token Compression - Token compression has emerged as a critical area for reducing the computational burden of large models, particularly in visual token processing, which has shown high redundancy [14]. - The article notes a surge in research focused on token compression techniques, which could significantly impact chip design by lowering application barriers for large models [14]. Group 5: Future Implications for Chip Design - The advancements in attention mechanisms, dynamic sparsity, low-bit quantization, and token compression are expected to have substantial implications for the design of future edge chips, which have lagged behind the development of large models [14].