Self-attention
Search documents
 手撕大模型,KVCache 原理及代码解析
 自动驾驶之心· 2025-10-20 06:30
 Core Insights - The article discusses the importance of KV Cache in enhancing the efficiency of large language models (LLMs) during autoregressive inference, particularly in the context of the Transformer architecture [1][20].   Group 1: Need for KV Cache - KV Cache is essential for storing intermediate computation results, which significantly improves the model's operational efficiency during text generation tasks [1][20]. - In standard Transformer decoding, each new token generation requires attention calculations that involve all previous tokens, leading to high computational complexity [2][6].   Group 2: Working Principle of KV Cache - The core idea of KV Cache is to cache the historical Key (K) and Value (V) matrices, thus avoiding redundant calculations and reducing time complexity from O(n²) to O(n) [4][7]. - The process involves calculating the new Query (Q) matrix and performing attention calculations with the cached K and V matrices, allowing for efficient token generation [4][10].   Group 3: Technical Details of KV Cache - KV Cache typically maintains independent caches for each attention head, with the cache structure dynamically growing until it reaches the model's maximum sequence length [11]. - While KV Cache improves speed, it requires additional memory, with models like GPT-3 consuming approximately 20KB of memory per token, leading to significant memory usage during batch processing [12].   Group 4: Optimization Strategies for KV Cache - Strategies such as Paged KV Cache, dynamic cache management, quantization, and selective caching are employed to enhance the efficiency of KV Cache while managing memory usage [22][18].    Group 5: Code Implementation - The article provides a code example demonstrating the implementation of KV Cache in self-attention mechanisms using PyTorch, highlighting the modifications needed to incorporate caching [14][17].    Group 6: Conclusion - Understanding the workings of KV Cache is crucial for optimizing inference performance in large models and addressing challenges in practical deployment [20].
 浙大提出Translution:统一Self-attention和Convolution,ViT、GPT架构迎来新一轮性能突破
 AI科技大本营· 2025-10-14 08:17
 Core Insights - The article discusses the introduction of a new deep neural network operation called Translution, which combines the adaptive modeling advantages of Self-Attention with the relative position modeling capabilities of Convolution, allowing for a unified approach to capturing representations that are intrinsically related to the data structure rather than absolute positions [1][5].   Group 1: Performance Improvements - Experimental results indicate that neural networks built on Translution have shown performance enhancements in both ViT and GPT architectures, suggesting a broad range of application prospects [3]. - In the context of natural language modeling tasks, models based on Translution have outperformed those using Self-Attention [4].   Group 2: Technical Details - The core idea behind Translution is to transform the "fixed weight kernel" of convolution operations into a "dynamic adaptive kernel" generated by the self-attention mechanism, addressing the limitations of current Transformer models [5]. - The performance metrics from the experiments show that Translution achieves lower perplexity scores compared to traditional Self-Attention methods across various architectures, indicating improved efficiency and effectiveness [4].   Group 3: Industry Implications - As the demand for larger models continues to grow, the limitations of merely increasing network parameters and training data have become apparent, leading to the need for innovative neural network designs like Translution to sustain the growth of deep learning [5]. - However, the advanced capabilities of Translution come with increased computational requirements, particularly in GPU memory, which may exacerbate the existing disparities in access to AI resources within the industry [6].