视觉Token
Search documents
CVPR 2026 | 从视觉Token内在变化量出发,实现VLM无损加速1.87倍
机器之心· 2026-03-15 06:00
Background and Motivation - The demand for high-resolution image understanding and long video processing has surged, leading to a significant increase in the number of visual tokens that large vision-language models (LVLMs) need to process. This has made inference efficiency a core bottleneck for deployment [5] - Existing token compression methods rely on attention weights to determine token importance, which has two critical flaws: position bias and incompatibility with efficient operators [5][7] Core Findings - Finding 1: Attention methods exhibit systematic end bias, with retention rates for end tokens reaching 80%-100% while front tokens only retain 10%-30%, showing no correlation with content importance. In contrast, L2 Norm shows a near-uniform distribution, avoiding position bias [8][10] - Finding 2: Tokens with high variation correspond to semantically important areas. Various metrics (L1 Norm, L2 Norm, and cosine similarity) show significant peaks in relevant areas, indicating that variation is a robust intrinsic property for measuring visual token importance, with L2 Norm being the optimal metric [12][14] Solution: V²Drop - V²Drop employs a multi-stage progressive pruning strategy during LLM inference to achieve efficient and unbiased token compression. The process includes: 1. Variation computation to calculate the L2 distance of each visual token to the previous layer as an importance score, with negligible additional overhead [15] 2. Token ranking and selection based on variation scores to retain the top K tokens, filtering out inactive tokens without introducing position bias [16] 3. Progressive dropping executed in shallow, mid, and deep layers, demonstrating superior performance compared to one-time pruning [18] Experimental Results - Image Understanding: In LLaVA-1.5-7B, V²Drop compresses 66.7% of tokens (retaining 192) while achieving a performance of 97.6%, surpassing the next best method, PDrop, at 96.0% [23] - Video Understanding: V²Drop retains only 25% of tokens while achieving a performance of 98.6%, outperforming DyCoke at 97.7% [25] - Efficiency Analysis: V²Drop significantly reduces LLM generation latency by 31.5% in image tasks and 74.2% in video tasks, while increasing throughput and reducing peak memory usage [27][28] Conclusion - V²Drop opens a new path for accelerating inference in visual language models by leveraging the relationship between token variation and task relevance. This framework is lightweight, progressive, and fully compatible with efficient operators, achieving optimal performance in both image and video understanding tasks [31]
智谱运气是差一点点,视觉Token研究又和DeepSeek撞车了
量子位· 2025-10-22 15:27
Core Viewpoint - The article discusses the competition between Zhipu and DeepSeek in the AI field, particularly focusing on the release of Zhipu's visual token solution, Glyph, which aims to address the challenges of long context in large language models (LLMs) [1][2][6]. Group 1: Context Expansion Challenges - The demand for long context in LLMs is increasing due to various applications such as document analysis and multi-turn dialogues [8]. - Expanding context length significantly increases computational costs; for instance, increasing context from 50K to 100K tokens can quadruple the computational consumption [9][10]. - Merely adding more tokens does not guarantee improved model performance, as excessive input can lead to noise interference and information overload [12][14]. Group 2: Existing Solutions - Three mainstream solutions to the long context problem are identified: 1. **Extended Position Encoding**: This method extends the existing position encoding range to accommodate longer inputs without retraining the model [15][16]. 2. **Attention Mechanism Modification**: Techniques like sparse and linear attention aim to improve token processing efficiency, but do not reduce the total token count [20][21]. 3. **Retrieval-Augmented Generation (RAG)**: This approach uses external retrieval to shorten inputs, but may slow down overall response time [22][23]. Group 3: Glyph Framework - Glyph proposes a new paradigm by converting long texts into images, allowing for higher information density and efficient processing by visual language models (VLMs) [25][26]. - By using visual tokens, Glyph can significantly reduce the number of tokens needed; for example, it can represent the entire text of "Jane Eyre" using only 80K visual tokens compared to 240K text tokens [32][36]. - The training process for Glyph involves three stages: continual pre-training, LLM-driven rendering search, and post-training, which collectively enhance the model's ability to interpret visual information [37][44]. Group 4: Performance and Results - Glyph achieves a token compression rate of 3-4 times while maintaining accuracy comparable to mainstream models [49]. - The implementation of Glyph results in approximately four times faster prefill and decoding speeds, as well as two times faster supervised fine-tuning (SFT) training [51]. - Glyph demonstrates strong performance in multimodal tasks, indicating its robust generalization capabilities [53]. Group 5: Contributors and Future Implications - The primary author of the paper is Jiale Cheng, a PhD student at Tsinghua University, with contributions from Yusen Liu, Xinyu Zhang, and Yulin Fei [57][62]. - The article suggests that visual tokens may redefine the information processing methods of LLMs, potentially leading to pixels replacing text as the fundamental unit of AI input [76][78].