关键 KV 占用空间

Search documents
Cache Me If You Can:陈丹琦团队如何「抓住」关键缓存,解放LLM内存?
机器之心· 2025-06-24 14:07
Core Viewpoint - The research by Chen Danqi's team at Princeton University introduces a unified metric called "KV Footprint" to measure the efficiency of key-value (KV) cache usage in language models, particularly for long-context tasks, addressing the challenges of memory consumption during the pre-fill and decoding stages [10][12][15]. Group 1 - The emergence of technologies like "long thinking chains" has created new workloads requiring models to generate thousands of tokens [2]. - Most language models are based on the Transformer architecture, which requires storing the attention states of all previous tokens in a KV cache, leading to linear memory growth with input length [3][5]. - The KV cache is crucial for fast inference, but its size can reach up to 42GB when processing long prompts, such as those with 128K tokens [5]. Group 2 - Previous works have proposed methods to evict parts of the KV pairs from memory to achieve "sparse attention," but comparing these methods fairly has been challenging [6][20]. - The research defines "Key KV Footprint" as the minimum KV footprint achievable while maintaining at least 90% performance relative to a full attention mechanism, ensuring that comparisons are meaningful [12][27]. Group 3 - The study reveals that previous KV eviction methods suffer from high peak memory issues, particularly with post-fill eviction methods that are incompatible with pre-fill eviction [13]. - The team developed PruLong, an end-to-end optimization method that learns which attention heads need to retain full KV cache and which do not, achieving a 12% reduction in KV footprint while maintaining performance on challenging recall tasks [15][36]. Group 4 - The research examines various efficient long-context methods and discusses their fit within the KV footprint framework, highlighting trade-offs and different sparsity concepts [28]. - The study categorizes KV entries as active, inactive, or evicted, defining KV occupancy as the number of non-evicted attention entries across all time steps [24][26]. Group 5 - PruLong optimizes the attention heads by minimizing the next token prediction loss, which aligns better with the usage of these models in text generation [37]. - The method utilizes natural long-context data for training, contrasting with previous approaches that relied on synthetic data, thus enhancing its applicability in real-world scenarios [39].