VScan
Search documents
无损加速视觉语言模型推理!轻松剪掉视觉冗余Token|腾讯AI Lab
量子位· 2025-07-04 01:42
Core Insights - The article discusses the challenges faced by large visual language models (LVLM) due to the exponential increase in visual token counts, leading to significant inference costs and performance bottlenecks [1][2][3] - Tencent AI Lab and CMU have proposed a new solution called VScan, which enhances inference efficiency without modifying model architecture or retraining, achieving up to 2.91x acceleration [2][5][38] Group 1 - The increase in visual tokens, such as LLaVA-NeXT processing up to 2,880 tokens and Qwen2.5-VL handling 16,384 tokens, has led to a quadratic growth in computational complexity during inference [2][4] - VScan has been empirically validated across multiple mainstream LVLMs, including LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VL, and Video-LLaVA, covering tasks like image question answering and video understanding [4][5] - VScan's two-stage token filtering mechanism effectively reduces visual token input while maintaining accuracy, making it suitable for various resource-constrained environments [5][28] Group 2 - Existing visual token pruning methods can be categorized into text-agnostic and text-aware approaches, but they often lack a comprehensive understanding of the cross-stage information flow in LVLMs [8][9] - VScan's design is based on a systematic analysis of visual token contributions throughout the entire inference process, from visual encoding to language decoding [10][12][19] - The article emphasizes that effective pruning strategies should consider the dynamic value of tokens across the entire encoding process rather than relying solely on final layer attention [15][22] Group 3 - VScan employs a dual scanning mechanism: global scanning to retain semantically critical tokens and local scanning to capture detailed information from overlooked areas [30][26] - The first pruning phase occurs in the visual encoding stage, while the second phase targets text-irrelevant visual information during the language decoding stage, optimizing the timing of pruning [27][24] - Experimental results demonstrate that VScan significantly reduces visual token counts and inference times while maintaining high accuracy, outperforming existing methods [29][28] Group 4 - VScan has been tested on various LVLMs, including LLaVA-1.5 and Qwen-2.5-VL, across multiple benchmark datasets, showing robust performance even under high compression rates [28][34] - In practical scenarios, VScan achieved a 1.37x speedup in inference efficiency with LLaVA-1.5-7B and a 2.05x speedup with LLaVA-NeXT-7B, while maintaining minimal performance degradation [36][38] - The solution is open-sourced on GitHub, allowing the community to further validate and expand upon this efficient pruning paradigm [6][39]