稀疏注意力
Search documents
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
3 6 Ke· 2026-01-14 08:39
Core Insights - The architecture competition for LLMs is entering a nuanced phase, with a shift from merely increasing model parameters to a focus on mixed architectures and efficiency tuning [1][4] - Transformer architecture is expected to maintain its status as the cornerstone of the AI ecosystem for at least the next few years, although adjustments in efficiency and mixed strategies are anticipated [4] - The rise of hybrid architectures and linear attention mechanisms is becoming a focal point in the industry, with models like DeepSeek V3 and R1 showcasing significant efficiency improvements [5][8] Group 1: Efficiency Wars - The industry is increasingly focusing on hybrid architectures and efficiency improvements, as demonstrated by models like DeepSeek V3, which significantly reduces KV Cache usage during inference [5] - The MoE architecture allows models to maintain a large parameter count (671 billion) while only activating 37 billion parameters during inference, highlighting a trend towards efficiency without sacrificing capacity [5] - Other models such as Qwen3-Next and Kimi Linear are adopting mixed strategies to balance long-distance dependencies and inference speed [8] Group 2: Diffusion Language Models - Diffusion language models (DLMs) are attractive due to their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [10][11] - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [11] - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [17][19] Group 3: Super Data Learners - A recent paper suggests that DLMs could be superior learners in a data-scarce environment, achieving better performance than autoregressive models when trained on limited data [17][19] - The phenomenon known as "Crossover" indicates that while autoregressive models learn faster with ample data, DLMs excel when data is restricted [19] - Factors contributing to DLMs' advantages include their ability to model dependencies between any positions in the text, deeper training through iterative denoising, and inherent data augmentation through the noise process [21]
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
机器之心· 2026-01-14 07:18
Core Insights - The article discusses the evolving landscape of large language models (LLMs) as of 2026, highlighting a shift from the dominance of the Transformer architecture to a focus on efficiency and hybrid architectures [1][4][5]. Group 1: Transformer Architecture and Efficiency - The Transformer architecture is expected to maintain its status as the foundation of the AI ecosystem for at least the next few years, supported by mature toolchains and optimization strategies [4]. - Recent developments indicate a shift towards hybrid architectures and efficiency improvements, rather than a complete overhaul of existing models [5]. - The industry is increasingly focusing on mixed architectures and efficiency, as demonstrated by models like DeepSeek V3 and R1, which utilize mixture of experts (MoE) and multi-head latent attention (MLA) to reduce inference costs while maintaining large parameter counts [7]. Group 2: Linear and Sparse Attention Mechanisms - The standard Transformer attention mechanism has a complexity of O(N^2), leading to exponential growth in computational costs with increasing context length [9]. - New models like Qwen3-Next and Kimi Linear are adopting hybrid strategies that combine efficient linear layers with full attention layers to balance long-distance dependencies and inference speed [14]. Group 3: Diffusion Language Models - Diffusion language models (DLMs) are gaining attention for their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [12]. - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [15]. - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [24][25]. Group 4: Data Scarcity and Learning Efficiency - The concept of "Crossover" suggests that while autoregressive models learn faster with ample data, DLMs excel when data is limited, achieving significant accuracy on benchmarks with relatively small datasets [27]. - DLMs demonstrate that increased training epochs do not necessarily lead to a decline in downstream task performance, offering a potential solution in an era of data scarcity [28].
北大袁境阳:稀疏注意力机制让模型 10 倍加速——Attention
3 6 Ke· 2026-01-07 07:58
Core Insights - The article discusses the Native Sparse Attention (NSA) mechanism, which is a significant advancement in natural language processing (NLP) and deep learning model optimization, addressing the challenges of long-context models [1][4][5] Group 1: NSA Mechanism and Innovations - NSA aims to fundamentally rewrite the structural contradictions of attention mechanisms by reorganizing information flow within the architecture, allowing for efficient processing of long contexts [5][6] - The architecture features three parallel attention paths: a compression path for global context aggregation, a selection path for retaining key details, and a sliding window path for local context modeling [8][14] - NSA achieves significant speed improvements, with training forward passes reaching up to 9 times faster than full attention in 64k context scenarios, while maintaining performance on various benchmarks [6][12] Group 2: Performance and Efficiency - In a model with 27 billion parameters, NSA demonstrated a reduction in KV memory usage to about one-tenth of the original, achieving close to a theoretical limit of 11.6× acceleration during decoding [6][12] - NSA outperforms existing sparse attention methods, indicating that performance and efficiency can coexist in long-context models [7][12] Group 3: Hardware Alignment - NSA is designed to align with modern GPU architectures, maximizing Tensor Core utilization by loading KV blocks in a way that minimizes memory access overhead [9][20] - The architecture's design allows for efficient data loading and processing, addressing the limitations of traditional dense attention mechanisms [20][30] Group 4: Training Awareness - NSA incorporates a training-aware design, allowing the model to learn sparse patterns during training rather than being forced into sparsity prematurely [21][22] - The architecture ensures that the model can effectively learn both local and global context relationships, which is crucial for maintaining performance in long-context tasks [17][22] Group 5: Future Implications - The article emphasizes the importance of sparse architectures in the context of evolving GPU capabilities, suggesting that the industry may be pushed towards sparse solutions to optimize performance [24][28] - NSA represents a foundational shift in how models can operate efficiently across their entire lifecycle, from pre-training to post-training phases, ensuring sustained performance in complex tasks [32][33]
DeepSeek的小更新,暴打了OpenAI,追上了Gemini
3 6 Ke· 2025-12-03 00:58
Core Insights - DeepSeek has launched two new models, DeepSeek V3.2 and DeepSeek-V3.2-Speciale, which are designed to compete with leading models like GPT-5 and Gemini [1][5][20]. Model Performance - DeepSeek V3.2 has shown competitive performance in various benchmarks, achieving scores close to or surpassing those of GPT-5 and Gemini in several tests [6][20]. - The model's performance in specific benchmarks includes: - AIME 2025: DeepSeek V3.2 scored 93.1, while DeepSeek V3.2-Speciale scored 96.0 [6]. - HMMT Feb 2025: DeepSeek V3.2 scored 92.5, and DeepSeek V3.2-Speciale scored 99.2 [6]. - Overall, DeepSeek V3.2-Speciale is noted for its ability to compete effectively with Gemini 3 [20][27]. Technological Innovations - DeepSeek has implemented Sparse Attention (DSA) in its models, which allows for more efficient processing of longer texts by reducing computational complexity [9][13]. - The company has focused on enhancing post-training processes for open-source models, investing over 10% of total training compute to improve model performance in challenging tasks [17][21]. - DeepSeek V3.2 Speciale encourages longer reasoning without penalizing the model for extended thought processes, enhancing its ability to tackle complex problems [18][20]. Cost Efficiency - Despite higher token consumption compared to competitors, DeepSeek offers a more cost-effective solution, with a significant price advantage over models like Gemini [32][33]. - For example, using 8077 tokens on DeepSeek costs approximately $0.0032, while using 4972 tokens on Gemini costs around $0.06, highlighting a 20-fold price difference [33]. Industry Context - The gap between open-source and closed-source models is reportedly widening, but DeepSeek is actively working to close this gap through innovative approaches and cost-saving measures [35][36]. - The company's strategy emphasizes algorithmic improvements over merely increasing computational power, aligning with industry insights on the importance of efficient model training [38][39].
再谈注意力:阿里、Kimi 都在用的 DeltaNet 和线性注意力新改进丨晚点播客
晚点LatePost· 2025-12-02 09:13
Core Insights - The article discusses advancements in linear attention mechanisms, particularly DeltaNet, which aims to improve the efficiency and effectiveness of large language models (LLMs) by reducing the computational complexity associated with traditional attention mechanisms [5][10][12]. Group 1: Linear Attention Mechanisms - Linear attention mechanisms, such as DeltaNet, were introduced to address the computational bottleneck of traditional attention mechanisms, which exhibit quadratic complexity with respect to input length [5][12]. - DeltaNet's development has been a collaborative effort, with significant contributions from researchers since its inception in 2021, focusing on improving the update rules and parallelization of linear attention [7][20][21]. - The recent open-source releases of Qwen3-Next and Kimi Linear models by Alibaba and Kimi, respectively, incorporate linear attention mechanisms, indicating a shift towards these more efficient models in flagship applications [5][24]. Group 2: DeltaNet and Its Evolution - DeltaNet was initially overlooked due to a lack of key architectural improvements and suboptimal implementations, but recent advancements have led to its increased adoption in industry [20][24]. - The introduction of the Gated DeltaNet variant enhances memory control and retrieval performance, making it more suitable for modern hardware [7][21][24]. - The relationship between DeltaNet and other models, such as Kimi Linear, highlights the trend of integrating linear attention with traditional full attention mechanisms to balance speed and capacity [24][25]. Group 3: Future Directions and Challenges - The article emphasizes the need for further exploration of update rules in linear attention mechanisms, suggesting that improvements in this area could lead to better performance and scalability [48][49]. - There is a discussion on the potential of combining sparse attention with linear attention to address long-text processing challenges, which remains a significant hurdle in current models [46][49]. - The ongoing debate in the industry regarding the effectiveness of linear versus full attention mechanisms reflects the complexities and trade-offs involved in model design for various applications [27][30].
DeepSeek的新模型很疯狂:整个AI圈都在研究视觉路线,Karpathy不装了
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the groundbreaking release of the DeepSeek-OCR model, which compresses 1000 words into 100 visual tokens while maintaining a high accuracy of 97% [1] - This model addresses the long-context efficiency issue in large language models (LLMs) and suggests a paradigm shift where visual inputs may be more effective than textual inputs [1][5] Group 1: Model Features and Performance - DeepSeek-OCR can process 200,000 pages of data daily using a single NVIDIA A100 GPU [1] - The model's compression efficiency is ten times better than traditional text tokens, allowing for a significant reduction in the number of tokens needed to represent information [9] - The model eliminates the need for tokenizers, which have been criticized for their complexity and inefficiency [6] Group 2: Community Reception and Expert Opinions - The open-source nature of DeepSeek-OCR has led to widespread validation and excitement within the AI community, with over 4000 stars on GitHub shortly after its release [2][1] - Experts like Andrej Karpathy have praised the model, highlighting its potential to redefine how LLMs process inputs [3][5] - The model has sparked discussions about the efficiency of visual tokens compared to text tokens, with some researchers noting that visual representations may offer better performance in certain contexts [9][11] Group 3: Implications for Future Research - The article suggests that the use of visual tokens could significantly expand the effective context length of models, potentially allowing for the integration of extensive internal documents into prompts [12][13] - There are references to previous research that laid the groundwork for similar concepts, indicating that while DeepSeek-OCR is innovative, it is part of a broader trend in the field [18][20] - The potential for combining DeepSeek-OCR with other recent advancements, such as sparse attention mechanisms, is highlighted as a promising avenue for future exploration [11][12]
人工智能专题:DeepSeek的稀疏注意力机制给AI产业释放更大的发展潜能
Zhongyuan Securities· 2025-10-16 11:46
Investment Rating - The industry investment rating is "Outperform the Market" with an expected increase of over 10% relative to the CSI 300 index in the next six months [41]. Core Insights - The report emphasizes that the introduction of sparse attention mechanisms, particularly through DeepSeek, significantly enhances the development potential of the AI industry [8][37]. - DeepSeek's advancements in attention mechanisms, including Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA), are pivotal in improving model performance and efficiency [18][23][37]. Summary by Sections 1. Relationship Between Attention Mechanism and Large Model Development - The attention mechanism, introduced to improve information processing efficiency, has become a core component of large models, addressing the limitations of traditional recurrent neural networks [11]. - Sparse attention reduces computational complexity from O(L²) to sub-quadratic levels, thus overcoming memory and computational bottlenecks [11]. 2. DeepSeek's Technological Improvements in Attention Mechanism - DeepSeek has made significant contributions in three main areas: Multi-head Latent Attention (MLA), Native Sparse Attention (NSA), and DeepSeek Sparse Attention (DSA) [12][18][23]. - MLA reduces memory usage by approximately 90% while maintaining model performance, significantly lowering training costs [16]. - NSA enhances long text processing speed by 11 times and achieves performance comparable to traditional models [18]. - DSA improves training and inference efficiency, leading to substantial cost reductions for model usage [23]. 3. DSA and NSA Unlock Greater Development Potential for the AI Industry - The integration of DSA and NSA allows for expanded model context and improved computational efficiency, which are crucial for meeting the demands of multi-modal applications [33][37]. - The trend towards longer input and output lengths necessitates innovative approaches to model training and performance enhancement [33].
第二代InfLLM开源,同尺寸快三倍,零参数,可训练稀疏注意力
3 6 Ke· 2025-10-09 12:12
Core Insights - InfLLM-V2 is an efficient sparse attention model designed to handle long texts with minimal data, achieving performance close to traditional dense models [1][2] - The model allows seamless switching between short and long text processing modes, significantly enhancing efficiency and quality for long-context tasks [1][2] - InfLLM-V2 demonstrates a fourfold speed improvement over dense attention mechanisms while maintaining 98.1% performance in long text understanding tasks and 99.7% in deep reasoning tasks [1][2] Summary by Sections Model Advantages - Low-cost training requires only 5 billion long text tokens to achieve sparse attention capability, resulting in reduced training costs and shorter adaptation cycles [2] - The model supports seamless switching from dense to sparse attention without adding parameters, aligning with mainstream training paradigms for stability and faster convergence [2] - Efficient operator implementation optimizes the time bottleneck in sparse attention through hardware-friendly designs, significantly reducing HBM I/O and computational overhead [2] Technical Mechanism - InfLLM-V2 replaces the dense attention paradigm, where each query interacts with all keys, with a sparse approach that limits interactions to a selected subset, thus reducing computational costs [3][4] - The model introduces a two-step process: block selection to determine relevant key-value subsets and sparse attention calculation only on selected subsets [4][6] Performance Evaluation - In long text understanding tasks, InfLLM-V2 matches the performance of dense attention models, while other sparse attention methods show performance degradation [9] - In deep reasoning tasks, InfLLM-V2 achieves comparable performance to dense attention, while NSA methods negatively impact model effectiveness [11] - Efficiency tests reveal that InfLLM-V2 can achieve 4-9 times acceleration in operator speed compared to dense attention, with significant improvements in prefill and decode phases [13][17] Future Developments - The company plans to continue optimizing the training and inference operators of InfLLM-V2 and integrate it into mainstream inference frameworks [20]
万亿的OpenAI,涨疯的Memory和新出炉的DeepSeek
傅里叶的猫· 2025-09-29 15:11
Group 1 - OpenAI is projected to become a trillion-dollar company, with significant investments in AI infrastructure and data centers [2][4][3] - OpenAI plans to invest $1 trillion globally to build data centers to meet future demand for over 20GW of computing power, with costs estimated at $500 billion per GW [4][5] - OpenAI's CEO emphasizes the massive energy and infrastructure requirements for next-generation AI, equating it to the power needs of over 13 million American households [3][4] Group 2 - The rising prices of memory components, particularly DDR, are impacting server businesses, leading to renegotiations of pricing with clients [6][10] - Major manufacturers like Samsung and SK Hynix are reducing DDR4 production in favor of more profitable DDR5 and HBM memory, contributing to price increases [10] - OpenAI's announcement of new AI data centers in the U.S. is expected to further drive demand for memory components, resulting in price hikes for DDR5 and NAND Flash [10][14] Group 3 - The DeepSeek V3.2-Exp model introduces sparse attention mechanisms to improve computational efficiency, leading to a 50% reduction in API service costs [22][28] - The model's performance remains comparable to previous versions, with some specific improvements in structured tasks, although there are noted regressions in certain areas [29][34] - The introduction of various kernel implementations for DeepSeek aims to optimize performance for different use cases, balancing speed and complexity [31][32]
反直觉: MoE混合专家模型和场景没什么关系
理想TOP2· 2025-08-28 16:01
Core Viewpoint - The MoE (Mixture of Experts) model is fundamentally a sparse attention mechanism aimed at improving computational efficiency, rather than a model where each expert corresponds to a specific scenario [1][2]. Group 1: Scene Limitations - Having multiple MoE sub-models does not mean they can only handle specific scenes; it is impractical to train separate models for each scenario under the one model paradigm [1]. - If models are divided by scene, it does not represent a true MoE structure [1]. Group 2: Uniform Distribution - If only one type of scenario is run, a significant portion of the model's parameters may remain unused, leading to inefficiencies [2]. - It is more effective to distribute tasks evenly among experts rather than assigning specific experts to specific tasks, as low-usage experts may not justify their inclusion [2]. Group 3: Multiple Experts Activation - The MoE model can activate multiple experts simultaneously, allowing for a more even distribution of computational resources and addressing more complex problems effectively [2]. - The essence of the MoE model lies in the fact that only a small number of parameters significantly influence the output, making it a sparse model that enhances computational efficiency [2]. Group 4: Understanding the Model - Describing different experts as being suited for specific scenarios is a simplification that aids understanding, but it does not reflect the intentional design of the model [3].