Workflow
线性注意力
icon
Search documents
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
3 6 Ke· 2026-01-14 08:39
Core Insights - The architecture competition for LLMs is entering a nuanced phase, with a shift from merely increasing model parameters to a focus on mixed architectures and efficiency tuning [1][4] - Transformer architecture is expected to maintain its status as the cornerstone of the AI ecosystem for at least the next few years, although adjustments in efficiency and mixed strategies are anticipated [4] - The rise of hybrid architectures and linear attention mechanisms is becoming a focal point in the industry, with models like DeepSeek V3 and R1 showcasing significant efficiency improvements [5][8] Group 1: Efficiency Wars - The industry is increasingly focusing on hybrid architectures and efficiency improvements, as demonstrated by models like DeepSeek V3, which significantly reduces KV Cache usage during inference [5] - The MoE architecture allows models to maintain a large parameter count (671 billion) while only activating 37 billion parameters during inference, highlighting a trend towards efficiency without sacrificing capacity [5] - Other models such as Qwen3-Next and Kimi Linear are adopting mixed strategies to balance long-distance dependencies and inference speed [8] Group 2: Diffusion Language Models - Diffusion language models (DLMs) are attractive due to their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [10][11] - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [11] - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [17][19] Group 3: Super Data Learners - A recent paper suggests that DLMs could be superior learners in a data-scarce environment, achieving better performance than autoregressive models when trained on limited data [17][19] - The phenomenon known as "Crossover" indicates that while autoregressive models learn faster with ample data, DLMs excel when data is restricted [19] - Factors contributing to DLMs' advantages include their ability to model dependencies between any positions in the text, deeper training through iterative denoising, and inherent data augmentation through the noise process [21]
Sebastian Raschka 2026预测:Transformer统治依旧,但扩散模型正悄然崛起
机器之心· 2026-01-14 07:18
Core Insights - The article discusses the evolving landscape of large language models (LLMs) as of 2026, highlighting a shift from the dominance of the Transformer architecture to a focus on efficiency and hybrid architectures [1][4][5]. Group 1: Transformer Architecture and Efficiency - The Transformer architecture is expected to maintain its status as the foundation of the AI ecosystem for at least the next few years, supported by mature toolchains and optimization strategies [4]. - Recent developments indicate a shift towards hybrid architectures and efficiency improvements, rather than a complete overhaul of existing models [5]. - The industry is increasingly focusing on mixed architectures and efficiency, as demonstrated by models like DeepSeek V3 and R1, which utilize mixture of experts (MoE) and multi-head latent attention (MLA) to reduce inference costs while maintaining large parameter counts [7]. Group 2: Linear and Sparse Attention Mechanisms - The standard Transformer attention mechanism has a complexity of O(N^2), leading to exponential growth in computational costs with increasing context length [9]. - New models like Qwen3-Next and Kimi Linear are adopting hybrid strategies that combine efficient linear layers with full attention layers to balance long-distance dependencies and inference speed [14]. Group 3: Diffusion Language Models - Diffusion language models (DLMs) are gaining attention for their ability to generate tokens quickly and cost-effectively through parallel generation, contrasting with the serial generation of autoregressive models [12]. - Despite their advantages, DLMs face challenges in integrating tool calls within response chains due to their simultaneous generation nature [15]. - Research indicates that DLMs may outperform autoregressive models when high-quality data is scarce, as they can benefit from multiple training epochs without overfitting [24][25]. Group 4: Data Scarcity and Learning Efficiency - The concept of "Crossover" suggests that while autoregressive models learn faster with ample data, DLMs excel when data is limited, achieving significant accuracy on benchmarks with relatively small datasets [27]. - DLMs demonstrate that increased training epochs do not necessarily lead to a decline in downstream task performance, offering a potential solution in an era of data scarcity [28].
再谈注意力:阿里、Kimi 都在用的 DeltaNet 和线性注意力新改进丨晚点播客
晚点LatePost· 2025-12-02 09:13
Core Insights - The article discusses advancements in linear attention mechanisms, particularly DeltaNet, which aims to improve the efficiency and effectiveness of large language models (LLMs) by reducing the computational complexity associated with traditional attention mechanisms [5][10][12]. Group 1: Linear Attention Mechanisms - Linear attention mechanisms, such as DeltaNet, were introduced to address the computational bottleneck of traditional attention mechanisms, which exhibit quadratic complexity with respect to input length [5][12]. - DeltaNet's development has been a collaborative effort, with significant contributions from researchers since its inception in 2021, focusing on improving the update rules and parallelization of linear attention [7][20][21]. - The recent open-source releases of Qwen3-Next and Kimi Linear models by Alibaba and Kimi, respectively, incorporate linear attention mechanisms, indicating a shift towards these more efficient models in flagship applications [5][24]. Group 2: DeltaNet and Its Evolution - DeltaNet was initially overlooked due to a lack of key architectural improvements and suboptimal implementations, but recent advancements have led to its increased adoption in industry [20][24]. - The introduction of the Gated DeltaNet variant enhances memory control and retrieval performance, making it more suitable for modern hardware [7][21][24]. - The relationship between DeltaNet and other models, such as Kimi Linear, highlights the trend of integrating linear attention with traditional full attention mechanisms to balance speed and capacity [24][25]. Group 3: Future Directions and Challenges - The article emphasizes the need for further exploration of update rules in linear attention mechanisms, suggesting that improvements in this area could lead to better performance and scalability [48][49]. - There is a discussion on the potential of combining sparse attention with linear attention to address long-text processing challenges, which remains a significant hurdle in current models [46][49]. - The ongoing debate in the industry regarding the effectiveness of linear versus full attention mechanisms reflects the complexities and trade-offs involved in model design for various applications [27][30].
Kimi开源新线性注意力架构,首次超越全注意力模型,推理速度暴涨6倍
量子位· 2025-10-31 06:27
Core Insights - The era of Transformers is being redefined with the introduction of the Kimi Linear architecture, which surpasses traditional attention models under the same training conditions [2][10]. Group 1: Kimi Linear Architecture - Kimi Linear employs a novel attention mechanism that reduces the KV cache requirement by 75% and achieves up to 6 times faster inference in long-context tasks [4][26]. - The architecture introduces Kimi Delta Attention (KDA), which allows for fine-grained control over memory retention, enabling the model to discard redundant information while preserving important data [12][10]. - KDA's state update mechanism is based on an improved Delta Rule, ensuring stability even with sequences of millions of tokens, preventing gradient explosion or vanishing [13][14]. Group 2: Performance and Efficiency - The model utilizes a 3:1 mixed layer design, combining three layers of linear attention followed by one layer of full attention, balancing global semantic modeling with resource efficiency [15]. - Kimi Linear has demonstrated superior performance across multiple benchmark tests, such as MMLU and BBH, outperforming traditional Transformers while maintaining accuracy in mathematical reasoning and code generation tasks [22][26]. - The architecture's deployment is seamless with existing vLLM inference frameworks, allowing for easy upgrades of Transformer-based systems to Kimi Linear [21]. Group 3: Industry Trends - The dominance of Transformers is being challenged, with alternative models like state space models (SSM) showing potential for efficient computation and long sequence modeling [28][30]. - Companies like Apple are exploring SSM architectures for their energy efficiency and lower latency, indicating a shift away from traditional Transformer reliance [30]. - The emergence of Kimi Linear signifies a move towards diverse innovations in AI architecture, suggesting a departure from the conventional Transformer path [32].
刚刚,Kimi开源新架构,开始押注线性注意力
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the advancements in attention mechanisms, particularly focusing on the Kimi Linear architecture, which combines linear attention and full attention to improve efficiency and performance in various tasks [1][2][4]. Group 1: Kimi Linear Architecture - Kimi Linear introduces a new hybrid linear attention architecture called Kimi Delta Attention (KDA), which optimizes memory usage in limited state RNNs through a more efficient gating mechanism [4][10]. - The architecture features a 3:1 ratio of KDA layers to periodic full attention layers, significantly reducing memory usage while maintaining or exceeding the quality of full attention [10][32]. - Kimi Linear has a total of 48 billion parameters, with 3 billion activated parameters, and can handle context lengths of up to 1 million tokens [5][10]. Group 2: Performance and Efficiency - Kimi Linear demonstrates superior performance across various tasks, outperforming traditional full attention methods, especially in long-context tasks, by reducing the need for large key-value caches by up to 75% [5][10]. - The model achieves a decoding throughput that is six times faster than complete multi-head attention models when processing long contexts [5][59]. - In comparative evaluations, Kimi Linear consistently outperforms baseline models like MLA and GDN-H in general knowledge, reasoning, and Chinese tasks [44][49]. Group 3: Technical Innovations - The KDA mechanism introduces fine-grained control over memory decay and position awareness, enhancing the model's expressiveness and efficiency [20][24]. - The architecture employs a block-wise recursive and intra-block parallel strategy to maximize matrix multiplication throughput, leveraging Tensor Cores effectively [26][59]. - The NoPE (No Position Encoding) design in Kimi Linear allows for efficient long-context training by delegating position information responsibilities to KDA layers [34][39]. Group 4: Experimental Results - Kimi Linear achieved the highest average scores in long-context benchmarks, demonstrating its effectiveness in handling extensive sequences [52][53]. - In reinforcement learning scenarios, Kimi Linear showed faster and better performance improvements compared to MLA, particularly in mathematical reasoning tasks [56][57]. - The model's efficiency remains high, with negligible latency overhead compared to GDN-H during pre-filling, while showing significant speed advantages as sequence lengths increase [59][60].
3700 次预训练寻找 “线性注意力” 非共识,MiniMax-01 开发者讲述 4 年探索
晚点LatePost· 2025-03-09 12:00
"我们跑的是下半场,赌的就是未来的长文本需求。" MiniMax 在今年 1 月发布了参数为 4560 亿的开源大模型 MiniMax-01,该模型就用到了他们开发的线 性注意力机制 "Lightning Attention"。 我们邀请了这个项目的负责人,MiniMax 高级研究总监钟怡然,来与我们一起聊线性注意力的研发过 程。钟怡然在 MiniMax 负责大模型网络架构设计,目前正开发多模态深度推理模型。 钟怡然曾担任上海人工智能实验室青年科学家,是新架构探索组的 PI(项目负责人);他在澳洲国立大 学获得博士学位,师从李宏东教授和 Richard Hartley 院士。他和他的团队已在一些国际顶级学术会议和 期刊上发表了 20 余篇关于模型新架构的论文,覆盖了当前多类非 Transformer 架构,如线性注意力机制 (线性注意力)、长卷积(Long Convolution)和线性循环网络(Linear RNN)。 在 2021 年,线性注意力还是一个 "看起来很美好的泡泡",怡然和团队就开始探索线性架构的实现。 嘉宾 丨 钟怡然 整理 丨 刘倩 程曼祺 上期播客中, 我们与清华的两位博士生,肖朝军和傅 ...