Workflow
门控注意力
icon
Search documents
必看!Sebastian Raschka新博客盘点了所有主要注意力机制
机器之心· 2026-03-23 07:10
Core Insights - The article discusses the recent release of a "LLM Architecture Gallery" by AI writer Sebastian Raschka, which has gained significant attention in the AI community [1][4] - Raschka has also published a blog titled "A Visual Guide to Attention Variants in Modern LLMs," aiming to serve as both a reference and a lightweight learning resource [6][7] Summary by Sections 1. Multi-Head Attention (MHA) - Multi-Head Attention (MHA) allows each token to view other visible tokens in the sequence, assigning weights to build a context-aware input representation [7][8] - MHA is the standard version of this concept in Transformers, running multiple self-attention heads in parallel with different learning projections [8][11] - The limitations of older RNN systems in handling long sequences led to the development of attention mechanisms, which allow direct access to relevant input tokens [12][13][21] 2. Grouped Query Attention (GQA) - GQA is a variant of standard MHA that allows several query heads to share the same key-value projections, reducing memory costs [35][36] - GQA has become a popular alternative to classic MHA due to its lower parameter count and KV cache traffic, making it easier to implement [40][41] - GQA is particularly useful as sequence lengths increase, providing significant memory savings [42][44] 3. Multi-Head Latent Attention (MLA) - MLA aims to reduce KV cache memory requirements by compressing stored content rather than sharing heads [51][54] - MLA has been shown to maintain better modeling performance compared to GQA, especially in larger models [58][60] - The implementation of MLA is more complex, but it becomes attractive as model size and context length increase [55][64] 4. Sliding Window Attention (SWA) - SWA reduces memory and computational costs by limiting the number of previous tokens each position can attend to, focusing on a fixed window of recent tokens [66][71] - Many architectures combine local layers with occasional global attention layers to ensure information can still propagate throughout the sequence [71][75] 5. DeepSeek Sparse Attention (DSA) - DSA, introduced in DeepSeek V3.2, allows each token to focus on a subset of previous tokens, using a learned sparse pattern rather than a fixed local window [76][78] - DSA is used alongside MLA to optimize both cache representation and attention patterns [88][90] 6. Gated Attention - Gated Attention is a modified full attention block that appears in mixed stacks, enhancing stability and control without completely replacing attention mechanisms [91][94] - It includes modifications such as an output gate and a zero-centered QK-Norm variant to improve predictability [98][100] 7. Mixed Attention Architectures - Mixed attention architectures retain a Transformer-like stack while replacing most expensive full attention layers with cheaper linear or state-space modules [105][108] - These architectures aim for long-context efficiency, balancing simplicity and modeling performance [129][130]