Kimi Delta Attention (KDA)
Search documents
AI产业跟踪:月之暗面发布全新注意力架构:KimiLinear,持续关注AgentLLM技术迭代
Changjiang Securities· 2025-11-06 11:05
Investment Rating - The report maintains a "Positive" investment rating for the industry [8]. Core Insights - On October 31, the company "月之暗面" launched a new hybrid linear attention architecture called Kimi Linear, aimed at addressing the computational efficiency and performance bottlenecks faced by current LLMs in handling long sequence tasks. The core code has been open-sourced and validated [2][5]. - Kimi Delta Attention (KDA) enhances expressive capability through a refined gating mechanism and a highly optimized block processing algorithm, potentially opening a new paradigm for cost reduction in token consumption [2][10]. - The report emphasizes continued optimism for the domestic AI industry chain, recommending shovel stocks and major players with significant positioning advantages [2][10]. Summary by Sections Event Description - The launch of Kimi Linear focuses on solving the core bottlenecks of traditional Transformers in long text processing and agent-based reasoning, with a 3:1 mixed hierarchical structure that reduces KV cache by 75% and improves long sequence decoding efficiency [10]. Performance Comparison - Kimi Linear outperforms Full Attention in various metrics, achieving the highest accuracy across tasks as sequence length increases, with significant improvements in convergence speed compared to GDN [10]. - In long context performance, Kimi Linear scores 54.5, surpassing MLA (52.2) and GDN-H (51.2), demonstrating its robustness in handling long texts [10]. Efficiency Comparison - Kimi Linear shows a dramatic advantage in decoding speed, requiring only 1.84ms per token for a 1M length, which is 6.3 times faster than MLA [10]. - The memory usage of Kimi Linear's KV cache is approximately 25% of that of the pure MLA model, indicating a potential for lower inference costs and improved user experience [10]. Future Outlook - The report suggests that KDA represents a significant potential for linear attention in various applications, particularly in long text reasoning and enterprise-level knowledge systems, with a focus on reducing inference costs and delays for large-scale deployment [10].
刚刚,Kimi开源新架构,开始押注线性注意力
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the advancements in attention mechanisms, particularly focusing on the Kimi Linear architecture, which combines linear attention and full attention to improve efficiency and performance in various tasks [1][2][4]. Group 1: Kimi Linear Architecture - Kimi Linear introduces a new hybrid linear attention architecture called Kimi Delta Attention (KDA), which optimizes memory usage in limited state RNNs through a more efficient gating mechanism [4][10]. - The architecture features a 3:1 ratio of KDA layers to periodic full attention layers, significantly reducing memory usage while maintaining or exceeding the quality of full attention [10][32]. - Kimi Linear has a total of 48 billion parameters, with 3 billion activated parameters, and can handle context lengths of up to 1 million tokens [5][10]. Group 2: Performance and Efficiency - Kimi Linear demonstrates superior performance across various tasks, outperforming traditional full attention methods, especially in long-context tasks, by reducing the need for large key-value caches by up to 75% [5][10]. - The model achieves a decoding throughput that is six times faster than complete multi-head attention models when processing long contexts [5][59]. - In comparative evaluations, Kimi Linear consistently outperforms baseline models like MLA and GDN-H in general knowledge, reasoning, and Chinese tasks [44][49]. Group 3: Technical Innovations - The KDA mechanism introduces fine-grained control over memory decay and position awareness, enhancing the model's expressiveness and efficiency [20][24]. - The architecture employs a block-wise recursive and intra-block parallel strategy to maximize matrix multiplication throughput, leveraging Tensor Cores effectively [26][59]. - The NoPE (No Position Encoding) design in Kimi Linear allows for efficient long-context training by delegating position information responsibilities to KDA layers [34][39]. Group 4: Experimental Results - Kimi Linear achieved the highest average scores in long-context benchmarks, demonstrating its effectiveness in handling extensive sequences [52][53]. - In reinforcement learning scenarios, Kimi Linear showed faster and better performance improvements compared to MLA, particularly in mathematical reasoning tasks [56][57]. - The model's efficiency remains high, with negligible latency overhead compared to GDN-H during pre-filling, while showing significant speed advantages as sequence lengths increase [59][60].