Core Insights - The article discusses the introduction of CalibQuant, a 1-bit KV cache quantization method for multimodal large language models (LLMs), which significantly enhances throughput while maintaining model performance [1][5][18]. Group 1: Motivation and Challenges - Current multimodal LLMs face challenges in handling large, high-resolution image or video data, where the KV cache mechanism increases memory usage proportionally to input length, limiting throughput [6]. - Existing quantization methods for LLM KV caches do not adequately address the unique visual redundancy in multimodal contexts, making them ineffective under extreme conditions [6][7]. Group 2: Methodology - CalibQuant employs a novel 1-bit quantization strategy that integrates post-scaling and calibration techniques to reduce memory and computational costs without altering the original model [3][5]. - The method includes channel-wise quantization, which refines the statistical range for quantization, thus preserving model performance better than global statistics [9][10]. - A post-scaling management strategy is introduced to optimize the computation order during dequantization, enhancing efficiency and reducing storage needs [11][12]. - A calibration method is proposed to adjust attention scores before softmax, mitigating the impact of extreme values resulting from 1-bit quantization [13][14]. Group 3: Experimental Results - The proposed quantization method was tested on LLaVA and InternVL models across various tasks, showing superior performance compared to existing methods like KIVI and VLCache, particularly in the captioning task [15][18]. - For instance, the method achieved a CIDEr score of 1.109 at 1-bit quantization for the llava-1.5-7b model, surpassing VLCache's score of 1.053 [15]. Group 4: Runtime Analysis - The runtime analysis demonstrated that the 1-bit quantization method consistently outperformed the 16-bit baseline in throughput across different memory budgets, achieving up to 459.016 tokens per second compared to the baseline's 40.816 tokens per second [17]. - This indicates a throughput improvement of approximately 9.88× to 11.24×, showcasing the method's effectiveness under constrained memory conditions [17]. Group 5: Conclusion - The article concludes that the proposed CalibQuant method effectively addresses the challenges of KV cache compression in multimodal LLMs, enhancing both computational efficiency and model performance [18].
10倍吞吐提升无损性能:多模态适用的KV cache量化策略来了,即插即用无需改原模型
量子位·2025-04-03 02:12