KV缓存压缩 - filings, earnings calls, financial reports, news

KV缓存压缩

Search documents

10% KV Cache实现无损数学推理！这个开源方法解决推理大模型「记忆过载」难题

量子位· 2025-06-16 04:50

Core Viewpoint - The introduction of R-KV offers a highly efficient compression method that transforms the "rambling" of large models into controllable memory entries, significantly reducing memory usage by 90%, increasing throughput by 6.6 times, and maintaining 100% accuracy [1][2]. Group 1: R-KV Methodology - R-KV employs a three-step process: redundancy identification, importance assessment, and dynamic eviction to manage key/value (KV) tokens during model decoding [5]. - The method allows for real-time compression of KV caches, retaining only important and non-redundant tokens, thus addressing redundancy issues during inference [7][9]. Group 2: Performance Metrics - In tests, R-KV demonstrated superior performance in challenging mathematical benchmarks, significantly outperforming baseline methods and even full KV implementations [19]. - R-KV achieved a memory saving of 90% while maintaining high throughput, with notable improvements in batch processing sizes and overall task performance [21]. Group 3: Visual Comparison - A visual comparison between R-KV and SnapKV shows that R-KV retains critical context and reduces noise effectively, leading to better task completion [12][15]. - R-KV's token selection spans the entire reasoning process, ensuring that essential keywords and values are preserved, unlike SnapKV, which tends to focus on local segments and may retain redundant information [14]. Group 4: Application Scenarios - R-KV is suitable for edge devices requiring long-chain inference, enabling even consumer-grade GPUs and mobile NPUs to run complex models [22]. - The method can also accelerate reinforcement learning sampling processes and is designed to be training-free and plug-and-play [22].

KV缓存压缩

链式思考（Chain-of-Thought

链式思考（Chain-of-Thought

10% KV Cache实现无损数学推理！这个开源方法解决推理大模型「记忆过载」难题

量子位· 2025-06-16 04:49

Core Viewpoint - The introduction of R-KV offers a highly efficient compression method that transforms the "rambling" of large models into controllable memory entries, significantly reducing memory usage by 90%, increasing throughput by 6.6 times, and maintaining 100% accuracy [1]. Group 1: R-KV Methodology - R-KV employs a three-step process: redundancy identification, importance assessment, and dynamic elimination to address the redundancy issue in large model reasoning [5]. - The method allows for real-time compression of key/value (KV) caches during model decoding, retaining only important and non-redundant tokens [7]. - R-KV utilizes a combination of importance scoring and redundancy filtering to preserve critical context while eliminating noise, leading to successful task completion [15]. Group 2: Performance Metrics - In performance tests, R-KV significantly outperformed baseline methods in challenging mathematical benchmark tests, achieving accuracy rates of 34% for R1-Llama-8B and 54% for R1-Qwen-14B on the MATH-500 dataset [19]. - R-KV demonstrated substantial memory savings and throughput improvements, with a maximum memory saving of 90% and a throughput of 2525.75 tokens per second [20][21]. - The method allows for larger batch processing sizes without sacrificing task performance, indicating its efficiency in handling extensive reasoning tasks [21]. Group 3: Application Scenarios - R-KV is suitable for edge devices requiring long-chain reasoning, enabling even consumer-grade GPUs and mobile NPUs to run complex models [22]. - The method can accelerate reinforcement learning sampling processes and is designed to be plug-and-play, requiring no training [22].