截断矩阵熵
Search documents
跨层压缩隐藏状态同时加速TTFT和压缩KV cache!
机器之心· 2025-11-13 04:12
Core Insights - The paper titled "UNComp: Can Matrix Entropy Uncover Sparsity?" addresses the paradox of matrix entropy in deep learning models, revealing that traditional matrix entropy increases with depth, contradicting the observed sparsity in deeper models [5][7] - A breakthrough is achieved by introducing Truncated Matrix Entropy, which shows a decreasing trend with increasing layers, explaining the sparsity phenomenon and providing a theoretical basis for compression strategies [7][12] Theoretical Framework - The new theoretical tool allows for a deeper understanding of the internal workings of models, focusing on the information flow patterns rather than merely optimizing attention distributions [8][12] - Key structural insights are identified, linking fluctuations in intermediate layer entropy to retrieval layers and heads, enabling structured pruning based on theoretical guidance [13] Practical Applications - The UNCOMP framework is designed to optimize both computation and memory by compressing hidden states during the prefill phase and KV Cache during decoding, achieving layer-wise and head-wise compression [16][17] - Experimental results indicate a 60% acceleration in the prefill phase and a 6.4 times increase in throughput, with KV Cache compressed to 4.74% [19] Performance Metrics - The framework maintains model performance even under extreme compression rates, with various methods showing high retention rates for Llama2 and Llama3 models, such as Ours-group achieving 98.42% and 84.13% respectively [20] - Merging retrieval layers with final layers shows minimal performance loss, with some tasks surpassing the full-size baseline [21] Conclusion - UNCOMP serves not only as a tool but also as a window into understanding the complex information compression behaviors within large language models [22]