线性注意力
Search documents
Kimi开源新线性注意力架构,首次超越全注意力模型,推理速度暴涨6倍
量子位· 2025-10-31 06:27
Core Insights - The era of Transformers is being redefined with the introduction of the Kimi Linear architecture, which surpasses traditional attention models under the same training conditions [2][10]. Group 1: Kimi Linear Architecture - Kimi Linear employs a novel attention mechanism that reduces the KV cache requirement by 75% and achieves up to 6 times faster inference in long-context tasks [4][26]. - The architecture introduces Kimi Delta Attention (KDA), which allows for fine-grained control over memory retention, enabling the model to discard redundant information while preserving important data [12][10]. - KDA's state update mechanism is based on an improved Delta Rule, ensuring stability even with sequences of millions of tokens, preventing gradient explosion or vanishing [13][14]. Group 2: Performance and Efficiency - The model utilizes a 3:1 mixed layer design, combining three layers of linear attention followed by one layer of full attention, balancing global semantic modeling with resource efficiency [15]. - Kimi Linear has demonstrated superior performance across multiple benchmark tests, such as MMLU and BBH, outperforming traditional Transformers while maintaining accuracy in mathematical reasoning and code generation tasks [22][26]. - The architecture's deployment is seamless with existing vLLM inference frameworks, allowing for easy upgrades of Transformer-based systems to Kimi Linear [21]. Group 3: Industry Trends - The dominance of Transformers is being challenged, with alternative models like state space models (SSM) showing potential for efficient computation and long sequence modeling [28][30]. - Companies like Apple are exploring SSM architectures for their energy efficiency and lower latency, indicating a shift away from traditional Transformer reliance [30]. - The emergence of Kimi Linear signifies a move towards diverse innovations in AI architecture, suggesting a departure from the conventional Transformer path [32].
刚刚,Kimi开源新架构,开始押注线性注意力
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the advancements in attention mechanisms, particularly focusing on the Kimi Linear architecture, which combines linear attention and full attention to improve efficiency and performance in various tasks [1][2][4]. Group 1: Kimi Linear Architecture - Kimi Linear introduces a new hybrid linear attention architecture called Kimi Delta Attention (KDA), which optimizes memory usage in limited state RNNs through a more efficient gating mechanism [4][10]. - The architecture features a 3:1 ratio of KDA layers to periodic full attention layers, significantly reducing memory usage while maintaining or exceeding the quality of full attention [10][32]. - Kimi Linear has a total of 48 billion parameters, with 3 billion activated parameters, and can handle context lengths of up to 1 million tokens [5][10]. Group 2: Performance and Efficiency - Kimi Linear demonstrates superior performance across various tasks, outperforming traditional full attention methods, especially in long-context tasks, by reducing the need for large key-value caches by up to 75% [5][10]. - The model achieves a decoding throughput that is six times faster than complete multi-head attention models when processing long contexts [5][59]. - In comparative evaluations, Kimi Linear consistently outperforms baseline models like MLA and GDN-H in general knowledge, reasoning, and Chinese tasks [44][49]. Group 3: Technical Innovations - The KDA mechanism introduces fine-grained control over memory decay and position awareness, enhancing the model's expressiveness and efficiency [20][24]. - The architecture employs a block-wise recursive and intra-block parallel strategy to maximize matrix multiplication throughput, leveraging Tensor Cores effectively [26][59]. - The NoPE (No Position Encoding) design in Kimi Linear allows for efficient long-context training by delegating position information responsibilities to KDA layers [34][39]. Group 4: Experimental Results - Kimi Linear achieved the highest average scores in long-context benchmarks, demonstrating its effectiveness in handling extensive sequences [52][53]. - In reinforcement learning scenarios, Kimi Linear showed faster and better performance improvements compared to MLA, particularly in mathematical reasoning tasks [56][57]. - The model's efficiency remains high, with negligible latency overhead compared to GDN-H during pre-filling, while showing significant speed advantages as sequence lengths increase [59][60].
3700 次预训练寻找 “线性注意力” 非共识,MiniMax-01 开发者讲述 4 年探索
晚点LatePost· 2025-03-09 12:00
"我们跑的是下半场,赌的就是未来的长文本需求。" MiniMax 在今年 1 月发布了参数为 4560 亿的开源大模型 MiniMax-01,该模型就用到了他们开发的线 性注意力机制 "Lightning Attention"。 我们邀请了这个项目的负责人,MiniMax 高级研究总监钟怡然,来与我们一起聊线性注意力的研发过 程。钟怡然在 MiniMax 负责大模型网络架构设计,目前正开发多模态深度推理模型。 钟怡然曾担任上海人工智能实验室青年科学家,是新架构探索组的 PI(项目负责人);他在澳洲国立大 学获得博士学位,师从李宏东教授和 Richard Hartley 院士。他和他的团队已在一些国际顶级学术会议和 期刊上发表了 20 余篇关于模型新架构的论文,覆盖了当前多类非 Transformer 架构,如线性注意力机制 (线性注意力)、长卷积(Long Convolution)和线性循环网络(Linear RNN)。 在 2021 年,线性注意力还是一个 "看起来很美好的泡泡",怡然和团队就开始探索线性架构的实现。 嘉宾 丨 钟怡然 整理 丨 刘倩 程曼祺 上期播客中, 我们与清华的两位博士生,肖朝军和傅 ...