Workflow
AI产业跟踪:月之暗面发布全新注意力架构:KimiLinear,持续关注AgentLLM技术迭代
Changjiang Securities·2025-11-06 11:05

Investment Rating - The report maintains a "Positive" investment rating for the industry [8]. Core Insights - On October 31, the company "月之暗面" launched a new hybrid linear attention architecture called Kimi Linear, aimed at addressing the computational efficiency and performance bottlenecks faced by current LLMs in handling long sequence tasks. The core code has been open-sourced and validated [2][5]. - Kimi Delta Attention (KDA) enhances expressive capability through a refined gating mechanism and a highly optimized block processing algorithm, potentially opening a new paradigm for cost reduction in token consumption [2][10]. - The report emphasizes continued optimism for the domestic AI industry chain, recommending shovel stocks and major players with significant positioning advantages [2][10]. Summary by Sections Event Description - The launch of Kimi Linear focuses on solving the core bottlenecks of traditional Transformers in long text processing and agent-based reasoning, with a 3:1 mixed hierarchical structure that reduces KV cache by 75% and improves long sequence decoding efficiency [10]. Performance Comparison - Kimi Linear outperforms Full Attention in various metrics, achieving the highest accuracy across tasks as sequence length increases, with significant improvements in convergence speed compared to GDN [10]. - In long context performance, Kimi Linear scores 54.5, surpassing MLA (52.2) and GDN-H (51.2), demonstrating its robustness in handling long texts [10]. Efficiency Comparison - Kimi Linear shows a dramatic advantage in decoding speed, requiring only 1.84ms per token for a 1M length, which is 6.3 times faster than MLA [10]. - The memory usage of Kimi Linear's KV cache is approximately 25% of that of the pure MLA model, indicating a potential for lower inference costs and improved user experience [10]. Future Outlook - The report suggests that KDA represents a significant potential for linear attention in various applications, particularly in long text reasoning and enterprise-level knowledge systems, with a focus on reducing inference costs and delays for large-scale deployment [10].