视觉Token压缩 - filings, earnings calls, financial reports, news

视觉Token压缩

Search documents

东方理工团队提出HiDrop：重构MLLM计算路径，压缩90%视觉Token实现2.2倍加速

机器之心· 2026-03-23 11:56

Core Insights - The article discusses the challenges and solutions related to the efficiency bottleneck in multi-modal large language models (MLLMs) due to the increased number of visual tokens compared to text [2][3] - It introduces HiDrop, a novel framework designed to compress visual tokens while maintaining model performance and improving computational efficiency [25] Group 1: MLLM Functionality and Challenges - Existing research typically employs fixed strategies for visual token pruning, neglecting the functional differences across various layers of MLLMs [3] - Analysis reveals that different layers in MLLMs serve distinct roles: shallow layers primarily transmit visual features, middle layers perform cross-modal fusion, and deep layers focus on semantic integration and reasoning [3][9] Group 2: HiDrop Framework - HiDrop employs a three-stage hierarchical alignment compression strategy, aligning visual token processing with the model's layer structure to significantly reduce computational costs while preserving performance [15][16] - The three stages include: 1. Shallow Layer: Delayed injection of visual tokens to minimize computational load without affecting performance [19] 2. Middle Layer: Concave pyramid pruning to aggressively reduce visual tokens, focusing on key tokens that significantly impact text representation [20] 3. Deep Layer: Early exit of visual tokens to streamline processing, allowing subsequent layers to focus on fused semantic representations [21] Group 3: Experimental Results - HiDrop achieves approximately 90% compression of visual tokens while maintaining 98.3% of the original model's performance, demonstrating a superior compression-performance trade-off [4][22] - The method also results in a 1.72× training speedup and a 2.2× pre-filling acceleration, indicating significant improvements in computational efficiency [24][25]