Workflow
KV缓存压缩
icon
Search documents
2倍提速!KV缓存压缩不只看重要性,上交大团队让模型推理「又快又稳」 | ICLR'26
量子位· 2026-03-31 01:53
Core Insights - The article discusses the challenges and solutions related to KV cache compression in long-context reasoning for Vision-Language Models (VLM) and Large Language Models (LLM) [1][2][42] - It introduces MixKV, a method that combines importance and diversity in KV cache selection to enhance stability and coverage in compressed contexts [5][13][42] Group 1: KV Cache Challenges - The lengthening of context leads to linear expansion of KV cache, resulting in increased memory usage and bandwidth costs, which negatively impacts throughput [3][5] - Traditional compression methods often focus solely on "importance," neglecting the inherent "semantic redundancy" present in multimodal KV caches, which can lead to instability [5][12] Group 2: Key Findings - The research team visualized the statistical properties of KV, revealing that multimodal inputs exhibit a higher degree of semantic redundancy, indicating a larger compressible space [8][10] - There are significant differences in redundancy levels across different heads within the same model, suggesting a non-uniform distribution of redundancy [10][12] Group 3: MixKV Solution - MixKV aims to retain KV entries that are both important and diverse, thereby reducing the risk of losing semantic coverage due to redundancy [13][23] - The method consists of two scoring steps (importance and diversity) and a head-wise mixing approach to adaptively balance the two factors based on redundancy levels [14][15][16] Group 4: Experimental Results - MixKV demonstrated consistent performance improvements across various benchmarks in multimodal understanding, long-context reasoning, and GUI localization tasks [25][29][37] - The method showed significant efficiency gains, reducing inference latency and peak memory usage under extreme compression conditions [41][42] Group 5: Conclusion - MixKV represents a critical upgrade for KV cache compression in long-context reasoning, emphasizing the need to consider redundancy structures in the design paradigm for scalable deployment of VLMs and LLMs [42]
10% KV Cache实现无损数学推理!这个开源方法解决推理大模型「记忆过载」难题
量子位· 2025-06-16 04:50
Core Viewpoint - The introduction of R-KV offers a highly efficient compression method that transforms the "rambling" of large models into controllable memory entries, significantly reducing memory usage by 90%, increasing throughput by 6.6 times, and maintaining 100% accuracy [1][2]. Group 1: R-KV Methodology - R-KV employs a three-step process: redundancy identification, importance assessment, and dynamic eviction to manage key/value (KV) tokens during model decoding [5]. - The method allows for real-time compression of KV caches, retaining only important and non-redundant tokens, thus addressing redundancy issues during inference [7][9]. Group 2: Performance Metrics - In tests, R-KV demonstrated superior performance in challenging mathematical benchmarks, significantly outperforming baseline methods and even full KV implementations [19]. - R-KV achieved a memory saving of 90% while maintaining high throughput, with notable improvements in batch processing sizes and overall task performance [21]. Group 3: Visual Comparison - A visual comparison between R-KV and SnapKV shows that R-KV retains critical context and reduces noise effectively, leading to better task completion [12][15]. - R-KV's token selection spans the entire reasoning process, ensuring that essential keywords and values are preserved, unlike SnapKV, which tends to focus on local segments and may retain redundant information [14]. Group 4: Application Scenarios - R-KV is suitable for edge devices requiring long-chain inference, enabling even consumer-grade GPUs and mobile NPUs to run complex models [22]. - The method can also accelerate reinforcement learning sampling processes and is designed to be training-free and plug-and-play [22].
10% KV Cache实现无损数学推理!这个开源方法解决推理大模型「记忆过载」难题
量子位· 2025-06-16 04:49
Core Viewpoint - The introduction of R-KV offers a highly efficient compression method that transforms the "rambling" of large models into controllable memory entries, significantly reducing memory usage by 90%, increasing throughput by 6.6 times, and maintaining 100% accuracy [1]. Group 1: R-KV Methodology - R-KV employs a three-step process: redundancy identification, importance assessment, and dynamic elimination to address the redundancy issue in large model reasoning [5]. - The method allows for real-time compression of key/value (KV) caches during model decoding, retaining only important and non-redundant tokens [7]. - R-KV utilizes a combination of importance scoring and redundancy filtering to preserve critical context while eliminating noise, leading to successful task completion [15]. Group 2: Performance Metrics - In performance tests, R-KV significantly outperformed baseline methods in challenging mathematical benchmark tests, achieving accuracy rates of 34% for R1-Llama-8B and 54% for R1-Qwen-14B on the MATH-500 dataset [19]. - R-KV demonstrated substantial memory savings and throughput improvements, with a maximum memory saving of 90% and a throughput of 2525.75 tokens per second [20][21]. - The method allows for larger batch processing sizes without sacrificing task performance, indicating its efficiency in handling extensive reasoning tasks [21]. Group 3: Application Scenarios - R-KV is suitable for edge devices requiring long-chain reasoning, enabling even consumer-grade GPUs and mobile NPUs to run complex models [22]. - The method can accelerate reinforcement learning sampling processes and is designed to be plug-and-play, requiring no training [22].