Workflow
KV Cache压缩
icon
Search documents
对抗KV Cache压缩的脆弱性:两行代码以最坏风险控制防御底层假设崩塌
机器之心· 2026-03-25 04:01
Core Insights - The article discusses the research paper "DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference," which identifies a fundamental flaw in the underlying assumptions of current KV Cache compression methods [3][11] - The research team proposes a new strategy called Defensive Aggregation, which shifts the focus from average optimization to worst-case risk control, aiming to enhance the robustness of KV Cache compression [5][16] Summary by Sections Research Background - The research team from the University of Science and Technology of China has previously developed popular KV Cache compression methods like AdaKV and CriticalKV, which significantly improve compression efficiency with minimal code changes [2] - The demand for KV Cache storage has surged due to the rapid growth of large models' long-context capabilities, leading to an influx of various KV Cache compression methods [2] Key Findings - The main assumption of existing KV Cache methods is that the importance of cache remains stable over time, which the research team found to be fundamentally flawed [3][4] - They observed that while average importance metrics generally reflect true cache importance, they can fail dramatically during specific time intervals, leading to significant performance issues [4][5] Proposed Solution - The Defensive Aggregation strategy is introduced to address the identified flaw, focusing on worst-case risk rather than average loss [5][11] - The core algorithm requires only two lines of code, emphasizing simplicity while achieving substantial performance improvements [6][7] Implementation Details - The first step in the algorithm involves estimating worst-case risk by retaining any cache that has shown importance at any historical moment, thus ensuring that potentially critical tokens are preserved [7] - The second step incorporates an adaptive prior-risk correction mechanism to account for limited observations, enhancing the robustness of the cache retention strategy [8] Performance Results - The new method, DefensiveKV, and its enhanced version, Layer-DefensiveKV, demonstrate significant performance improvements across various tasks and datasets, achieving a reduction in quality loss from 9.6% to 4.1% and further to 2.1% under stringent conditions [11][13] - The research highlights the importance of redefining optimization goals in KV Cache compression, advocating for a defensive strategy to counteract the inherent weaknesses of existing assumptions [16] Additional Insights - The article emphasizes the continuous improvement in KV Cache compression performance over the past year, showcasing the evolution from AdaKV to CriticalKV and now to DefensiveKV, with performance scores rising from 39.0 to 91.4 [16] - Defensive Aggregation is presented as a complementary method that can be integrated with existing KV Cache compression techniques for further performance enhancement [16]
上海AI Lab胡侠:KV Cache压缩之后,可让价格2万美金的GPU发挥出20万美金的价值丨GAIR 2025
雷峰网· 2025-12-12 07:16
Core Viewpoint - The article discusses advancements in large language models (LLMs) focusing on increasing context length and improving inference efficiency through a method called "Lossy Computation" [3][4]. Group 1: Advancements in Large Language Models - Different LLM vendors have made significant breakthroughs in handling ultra-long contexts, with models like MiniMax-M1 and Qwen2.5-1M supporting inputs of millions of tokens [2]. - The competition to enhance context length in LLMs is ongoing, as longer contexts provide greater potential for applications in fields like finance, law, and healthcare [3]. - The research team led by Hu Xia proposes using "Lossy Computation" to improve inference efficiency by intentionally introducing controllable information loss without degrading performance [4][7]. Group 2: Technical Breakthroughs - The proposed method achieved two key breakthroughs: extending the context length of LLMs to eight times the original level and quantifying the KV Cache to 2 bits, resulting in an 8-fold increase in memory efficiency and a 3.5-fold speedup [4][12]. - The approach involves sacrificing some precision to significantly reduce computational and storage costs, focusing on model parameter quantization, KV Cache compression, model pruning, and knowledge distillation [3][4]. Group 3: Application and Impact - The KV Cache is crucial in LLM training and inference, with the potential to increase GPU storage capacity significantly by compressing from 16 bits to 2 bits, effectively enhancing the value of expensive GPUs [13]. - The "Lossy Computation" method has been tested primarily on the Llama model, with results expected to be published in 2024 [14]. - The method is already being utilized in mainstream open-source software packages like Hugging Face's Transformers and llama.cpp [15]. Group 4: Future Directions and Considerations - The research indicates that while the method is primarily designed for language models, its effects on multimodal models or other intelligent agents may vary and require further analysis [19]. - The potential applications of the "Lossy Computation" method extend to chatbot systems and healthcare, particularly in rare disease diagnosis, where it has shown promising results [28]. - Future research may explore the practical application of 2-bit compression in real-world scenarios and the need for hardware considerations to maximize the method's potential [29][30].
将KV Cache预算降至1.5%!他们用进化算法把大模型内存占用砍下来了
机器之心· 2025-09-14 05:16
Core Insights - EvolKV achieves superior performance with only 1.5% of the full KV cache budget, significantly reducing inference costs for large language models [1][11][25] - The traditional KV cache methods face challenges with long input texts, leading to increased storage requirements and slower processing [3][4] KV Cache Optimization - Existing KV cache compression methods primarily rely on heuristic approaches, which may not optimally retain task-relevant information [4][9] - EvolKV introduces an evolutionary framework that adaptively allocates KV cache budgets across transformer layers, optimizing for downstream task performance [6][10] Performance Improvements - In various benchmark tests, EvolKV consistently outperforms baseline methods, achieving up to a 13% improvement in the Needle-in-a-Haystack benchmark and maintaining high accuracy in the GSM8K dataset [11][30][25] - The method demonstrates strong adaptability across diverse tasks, maintaining competitive performance even with reduced cache budgets [25][29] Experimental Results - Comprehensive experiments on Mistral 7B-Instruct and Llama-3-8B-Instruct show that EvolKV outperforms all baseline methods across multiple KV cache budget configurations [22][24] - In the LongBench evaluation, EvolKV consistently achieved the highest average performance, even surpassing the full model in certain configurations [22][25] Evolutionary Algorithm Mechanism - The evolutionary algorithm generates candidate solutions and evaluates their fitness based on downstream task performance, guiding the optimization process [13][14] - The optimization process is structured in groups to enhance efficiency, allowing for a more stable optimization dynamic [16][17] Cache Budget Allocation - EvolKV employs a dynamic, task-driven approach to allocate KV cache budgets, ensuring that the distribution aligns with the functional contributions of different transformer layers [10][19] - The method includes a mechanism for adjusting the total KV cache budget to ensure fairness in evaluation [20]