分组潜在注意力机制GLA

Search documents
Mamba核心作者新作:取代DeepSeek在用的注意力机制,专为推理打造
猿大侠· 2025-06-02 04:22
Core Insights - The article discusses a new research paper by Tri Dao and his team from Princeton University, which introduces two attention mechanisms specifically designed for inference, significantly improving decoding speed and throughput while maintaining model performance [1][2][5]. Group 1: Research Contributions - The paper presents two main contributions: Grouped-Tied Attention (GTA) and Grouped Latent Attention (GLA). GTA reduces KV cache usage by approximately 50% compared to the GQA mechanism integrated into LLaMA 3, while GLA offers faster decoding speeds than the MLA mechanism used by DeepSeek, achieving up to 2x speed improvements in certain scenarios [2][11][22]. - GTA is described as an effective alternative to GQA, and GLA serves as a practical substitute for MLA, maintaining comparable model quality while optimizing memory usage and computational efficiency [3][12]. Group 2: Mechanism Design - GTA combines and reuses the key and value states of different query heads, reducing memory transfer frequency. It groups multiple heads to share the same KV parameters, contrasting with traditional multi-head attention mechanisms that require independent storage for each head [15][16]. - GLA enhances hardware efficiency by increasing the computational load per byte of memory, thereby reducing reliance on memory bandwidth while maintaining parallel scalability for decoding speed [18][19]. Group 3: Experimental Results - The team conducted experiments on models of various sizes (small, medium, large, and XL) trained on the FineWeb-Edu-100B dataset, demonstrating that GTA outperforms GQA in medium to large models, indicating its suitability for further model expansion [22][23]. - The results show that both GTA and GLA maintain or improve downstream task performance as model size increases, validating their effectiveness as alternatives to existing mechanisms [25][37]. Group 4: Performance Metrics - The evaluation metrics included perplexity and accuracy on downstream tasks, as well as efficiency indicators such as decoding latency, throughput, and KV cache usage. GTA reduced KV cache usage by about 50% compared to GQA without sacrificing model quality [27][28]. - GLA demonstrated superior throughput in real-time server performance tests, especially under concurrent requests, indicating its efficiency in handling long contexts and varying request lengths [31][34].