Workflow
线性注意力
icon
Search documents
再谈注意力:阿里、Kimi 都在用的 DeltaNet 和线性注意力新改进丨晚点播客
晚点LatePost· 2025-12-02 09:13
Core Insights - The article discusses advancements in linear attention mechanisms, particularly DeltaNet, which aims to improve the efficiency and effectiveness of large language models (LLMs) by reducing the computational complexity associated with traditional attention mechanisms [5][10][12]. Group 1: Linear Attention Mechanisms - Linear attention mechanisms, such as DeltaNet, were introduced to address the computational bottleneck of traditional attention mechanisms, which exhibit quadratic complexity with respect to input length [5][12]. - DeltaNet's development has been a collaborative effort, with significant contributions from researchers since its inception in 2021, focusing on improving the update rules and parallelization of linear attention [7][20][21]. - The recent open-source releases of Qwen3-Next and Kimi Linear models by Alibaba and Kimi, respectively, incorporate linear attention mechanisms, indicating a shift towards these more efficient models in flagship applications [5][24]. Group 2: DeltaNet and Its Evolution - DeltaNet was initially overlooked due to a lack of key architectural improvements and suboptimal implementations, but recent advancements have led to its increased adoption in industry [20][24]. - The introduction of the Gated DeltaNet variant enhances memory control and retrieval performance, making it more suitable for modern hardware [7][21][24]. - The relationship between DeltaNet and other models, such as Kimi Linear, highlights the trend of integrating linear attention with traditional full attention mechanisms to balance speed and capacity [24][25]. Group 3: Future Directions and Challenges - The article emphasizes the need for further exploration of update rules in linear attention mechanisms, suggesting that improvements in this area could lead to better performance and scalability [48][49]. - There is a discussion on the potential of combining sparse attention with linear attention to address long-text processing challenges, which remains a significant hurdle in current models [46][49]. - The ongoing debate in the industry regarding the effectiveness of linear versus full attention mechanisms reflects the complexities and trade-offs involved in model design for various applications [27][30].
Kimi开源新线性注意力架构,首次超越全注意力模型,推理速度暴涨6倍
量子位· 2025-10-31 06:27
Core Insights - The era of Transformers is being redefined with the introduction of the Kimi Linear architecture, which surpasses traditional attention models under the same training conditions [2][10]. Group 1: Kimi Linear Architecture - Kimi Linear employs a novel attention mechanism that reduces the KV cache requirement by 75% and achieves up to 6 times faster inference in long-context tasks [4][26]. - The architecture introduces Kimi Delta Attention (KDA), which allows for fine-grained control over memory retention, enabling the model to discard redundant information while preserving important data [12][10]. - KDA's state update mechanism is based on an improved Delta Rule, ensuring stability even with sequences of millions of tokens, preventing gradient explosion or vanishing [13][14]. Group 2: Performance and Efficiency - The model utilizes a 3:1 mixed layer design, combining three layers of linear attention followed by one layer of full attention, balancing global semantic modeling with resource efficiency [15]. - Kimi Linear has demonstrated superior performance across multiple benchmark tests, such as MMLU and BBH, outperforming traditional Transformers while maintaining accuracy in mathematical reasoning and code generation tasks [22][26]. - The architecture's deployment is seamless with existing vLLM inference frameworks, allowing for easy upgrades of Transformer-based systems to Kimi Linear [21]. Group 3: Industry Trends - The dominance of Transformers is being challenged, with alternative models like state space models (SSM) showing potential for efficient computation and long sequence modeling [28][30]. - Companies like Apple are exploring SSM architectures for their energy efficiency and lower latency, indicating a shift away from traditional Transformer reliance [30]. - The emergence of Kimi Linear signifies a move towards diverse innovations in AI architecture, suggesting a departure from the conventional Transformer path [32].
刚刚,Kimi开源新架构,开始押注线性注意力
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the advancements in attention mechanisms, particularly focusing on the Kimi Linear architecture, which combines linear attention and full attention to improve efficiency and performance in various tasks [1][2][4]. Group 1: Kimi Linear Architecture - Kimi Linear introduces a new hybrid linear attention architecture called Kimi Delta Attention (KDA), which optimizes memory usage in limited state RNNs through a more efficient gating mechanism [4][10]. - The architecture features a 3:1 ratio of KDA layers to periodic full attention layers, significantly reducing memory usage while maintaining or exceeding the quality of full attention [10][32]. - Kimi Linear has a total of 48 billion parameters, with 3 billion activated parameters, and can handle context lengths of up to 1 million tokens [5][10]. Group 2: Performance and Efficiency - Kimi Linear demonstrates superior performance across various tasks, outperforming traditional full attention methods, especially in long-context tasks, by reducing the need for large key-value caches by up to 75% [5][10]. - The model achieves a decoding throughput that is six times faster than complete multi-head attention models when processing long contexts [5][59]. - In comparative evaluations, Kimi Linear consistently outperforms baseline models like MLA and GDN-H in general knowledge, reasoning, and Chinese tasks [44][49]. Group 3: Technical Innovations - The KDA mechanism introduces fine-grained control over memory decay and position awareness, enhancing the model's expressiveness and efficiency [20][24]. - The architecture employs a block-wise recursive and intra-block parallel strategy to maximize matrix multiplication throughput, leveraging Tensor Cores effectively [26][59]. - The NoPE (No Position Encoding) design in Kimi Linear allows for efficient long-context training by delegating position information responsibilities to KDA layers [34][39]. Group 4: Experimental Results - Kimi Linear achieved the highest average scores in long-context benchmarks, demonstrating its effectiveness in handling extensive sequences [52][53]. - In reinforcement learning scenarios, Kimi Linear showed faster and better performance improvements compared to MLA, particularly in mathematical reasoning tasks [56][57]. - The model's efficiency remains high, with negligible latency overhead compared to GDN-H during pre-filling, while showing significant speed advantages as sequence lengths increase [59][60].
3700 次预训练寻找 “线性注意力” 非共识,MiniMax-01 开发者讲述 4 年探索
晚点LatePost· 2025-03-09 12:00
"我们跑的是下半场,赌的就是未来的长文本需求。" MiniMax 在今年 1 月发布了参数为 4560 亿的开源大模型 MiniMax-01,该模型就用到了他们开发的线 性注意力机制 "Lightning Attention"。 我们邀请了这个项目的负责人,MiniMax 高级研究总监钟怡然,来与我们一起聊线性注意力的研发过 程。钟怡然在 MiniMax 负责大模型网络架构设计,目前正开发多模态深度推理模型。 钟怡然曾担任上海人工智能实验室青年科学家,是新架构探索组的 PI(项目负责人);他在澳洲国立大 学获得博士学位,师从李宏东教授和 Richard Hartley 院士。他和他的团队已在一些国际顶级学术会议和 期刊上发表了 20 余篇关于模型新架构的论文,覆盖了当前多类非 Transformer 架构,如线性注意力机制 (线性注意力)、长卷积(Long Convolution)和线性循环网络(Linear RNN)。 在 2021 年,线性注意力还是一个 "看起来很美好的泡泡",怡然和团队就开始探索线性架构的实现。 嘉宾 丨 钟怡然 整理 丨 刘倩 程曼祺 上期播客中, 我们与清华的两位博士生,肖朝军和傅 ...