Workflow
Kimi Linear
icon
Search documents
再谈注意力:阿里、Kimi 都在用的 DeltaNet 和线性注意力新改进丨晚点播客
晚点LatePost· 2025-12-02 09:13
Core Insights - The article discusses advancements in linear attention mechanisms, particularly DeltaNet, which aims to improve the efficiency and effectiveness of large language models (LLMs) by reducing the computational complexity associated with traditional attention mechanisms [5][10][12]. Group 1: Linear Attention Mechanisms - Linear attention mechanisms, such as DeltaNet, were introduced to address the computational bottleneck of traditional attention mechanisms, which exhibit quadratic complexity with respect to input length [5][12]. - DeltaNet's development has been a collaborative effort, with significant contributions from researchers since its inception in 2021, focusing on improving the update rules and parallelization of linear attention [7][20][21]. - The recent open-source releases of Qwen3-Next and Kimi Linear models by Alibaba and Kimi, respectively, incorporate linear attention mechanisms, indicating a shift towards these more efficient models in flagship applications [5][24]. Group 2: DeltaNet and Its Evolution - DeltaNet was initially overlooked due to a lack of key architectural improvements and suboptimal implementations, but recent advancements have led to its increased adoption in industry [20][24]. - The introduction of the Gated DeltaNet variant enhances memory control and retrieval performance, making it more suitable for modern hardware [7][21][24]. - The relationship between DeltaNet and other models, such as Kimi Linear, highlights the trend of integrating linear attention with traditional full attention mechanisms to balance speed and capacity [24][25]. Group 3: Future Directions and Challenges - The article emphasizes the need for further exploration of update rules in linear attention mechanisms, suggesting that improvements in this area could lead to better performance and scalability [48][49]. - There is a discussion on the potential of combining sparse attention with linear attention to address long-text processing challenges, which remains a significant hurdle in current models [46][49]. - The ongoing debate in the industry regarding the effectiveness of linear versus full attention mechanisms reflects the complexities and trade-offs involved in model design for various applications [27][30].
Which Attention is All You Need?
机器之心· 2025-11-09 01:30
Core Insights - The article discusses the ongoing innovations and challenges in the Attention mechanism within AI and Robotics, highlighting the need for breakthroughs in algorithm design to address computational complexities and enhance performance [5][7]. Group 1: Attention Mechanism Innovations - The industry is focusing on optimizing the Attention mechanism due to the computational complexity of O(N^2) associated with standard self-attention, which poses a fundamental obstacle for efficient long-sequence modeling [9]. - Two main paths for improving Attention have emerged: Linear Attention, which aims to reduce complexity to O(N), and Sparse Attention, which seeks to limit calculations to a subset of important tokens [10][13]. - Kimi Linear, a recent development, has shown significant improvements over traditional full attention methods, achieving up to 75% reduction in KV cache requirements and processing contexts of up to 1 million tokens six times faster than full attention [11][12]. Group 2: Linear Attention Approaches - Linear Attention can be categorized into three main types: Kernelized methods, forgetting mechanisms, and in-context learning, each aiming to optimize the attention process while maintaining performance [10][11]. - The Kimi Linear architecture, which incorporates a channel-wise gating mechanism, optimizes memory usage in RNNs and demonstrates superior performance across various scenarios [12]. - The design of Kimi Linear includes a hierarchical mixed architecture that combines linear and full attention layers, enhancing its efficiency and effectiveness [12]. Group 3: Sparse Attention Strategies - Sparse Attention focuses on pre-selecting a subset of important tokens for attention calculations, utilizing methods such as fixed patterns, block-sparse, and clustering approaches [13][14]. - DeepSeek's NSA and DSA represent significant advancements in Sparse Attention, with DSA employing a token-wise sparse strategy that dramatically reduces attention complexity while maintaining performance [16][17]. - In tests, DSA has achieved a reduction in attention complexity from O(L^2) to O(Lk), resulting in cost reductions of 60%-70% during both pre-filling and decoding phases [17].
AI产业跟踪:月之暗面发布全新注意力架构:KimiLinear,持续关注AgentLLM技术迭代
Changjiang Securities· 2025-11-06 11:05
Investment Rating - The report maintains a "Positive" investment rating for the industry [8]. Core Insights - On October 31, the company "月之暗面" launched a new hybrid linear attention architecture called Kimi Linear, aimed at addressing the computational efficiency and performance bottlenecks faced by current LLMs in handling long sequence tasks. The core code has been open-sourced and validated [2][5]. - Kimi Delta Attention (KDA) enhances expressive capability through a refined gating mechanism and a highly optimized block processing algorithm, potentially opening a new paradigm for cost reduction in token consumption [2][10]. - The report emphasizes continued optimism for the domestic AI industry chain, recommending shovel stocks and major players with significant positioning advantages [2][10]. Summary by Sections Event Description - The launch of Kimi Linear focuses on solving the core bottlenecks of traditional Transformers in long text processing and agent-based reasoning, with a 3:1 mixed hierarchical structure that reduces KV cache by 75% and improves long sequence decoding efficiency [10]. Performance Comparison - Kimi Linear outperforms Full Attention in various metrics, achieving the highest accuracy across tasks as sequence length increases, with significant improvements in convergence speed compared to GDN [10]. - In long context performance, Kimi Linear scores 54.5, surpassing MLA (52.2) and GDN-H (51.2), demonstrating its robustness in handling long texts [10]. Efficiency Comparison - Kimi Linear shows a dramatic advantage in decoding speed, requiring only 1.84ms per token for a 1M length, which is 6.3 times faster than MLA [10]. - The memory usage of Kimi Linear's KV cache is approximately 25% of that of the pure MLA model, indicating a potential for lower inference costs and improved user experience [10]. Future Outlook - The report suggests that KDA represents a significant potential for linear attention in various applications, particularly in long text reasoning and enterprise-level knowledge systems, with a focus on reducing inference costs and delays for large-scale deployment [10].
Kimi Linear一作张宇:关于模型训练的一些感想
自动驾驶之心· 2025-11-06 00:04
Core Insights - The article discusses the development and features of the Kimi Linear model, emphasizing its innovative architecture and training process [4][5][10]. Model Architecture - Kimi Linear adopts a hybrid model approach, combining Linear Attention with a ratio of KDA:MLA set at 3:1, which was found to be optimal for balancing efficiency and performance [5]. - The model's architecture builds upon the design principles of Moonlight, with a significant increase in the sparsity of MoE from 8 to 32 [4]. Training Process - The model was trained on 5.7 trillion tokens, marking a significant scale-up from previous models, with a focus on overcoming challenges in distributed training [10][12]. - The training process involved rigorous monitoring and adjustments, including switching key parameters from bf16 to fp32 to ensure stability and performance [12][13]. Performance and Benchmarking - Despite being a smaller model, Kimi Linear demonstrated substantial improvements in benchmark comparisons, often outperforming larger models in specific tasks [7][14]. - The model's decoding efficiency was enhanced, achieving a speedup of approximately 6 times due to the reduced KV Cache usage from KDA [8]. Future Directions - The article indicates that Kimi aims to establish itself as a flagship model, with ongoing efforts to refine its architecture and performance metrics [17][19]. - The focus on hybrid models and efficient attention mechanisms is highlighted as a key area for future research and development within the industry [19].
Kimi开源新线性注意力架构,人工智能AIETF(515070)持仓股三六零盘中涨超7%
Mei Ri Jing Ji Xin Wen· 2025-11-03 02:54
Group 1 - The A-share market experienced a decline, with the ChiNext index dropping by 1% and sectors such as Hainan, gaming, and solar thermal power showing gains, while precious metals and battery sectors faced losses [1] - The AI ETF (515070) fell by 1.53%, with notable stock movements including 37 Interactive Entertainment hitting the daily limit, 360 Technology rising by 7.1%, and Stone Technology dropping by 5.2% [1] - The Kimi Linear architecture, which surpasses the Transformer architecture in various scenarios, introduces the "Kimi Delta Attention" mechanism, achieving a 75% reduction in KV cache usage and a 6-fold increase in decoding throughput [1] Group 2 - CITIC Securities analysis indicates a shift in AI large model development from a focus on parameter scale to achieving higher "capability density" and better architectural efficiency, driven by algorithmic innovations inspired by brain science [2] - This transition is expected to lower the computational threshold, enabling small and medium enterprises to access AI technology at reduced costs, thus creating broader industrial applications and investment opportunities [2] - The AI ETF (515070) tracks the CS AI Theme Index (930713), focusing on companies providing technology and resources for AI, with top-weighted stocks including major domestic tech leaders [2]
腾讯研究院AI速递 20251103
腾讯研究院· 2025-11-02 16:06
Group 1: AI Security Solutions - OpenAI has launched the "white hat" Agent Aardvark powered by GPT-5, capable of automatically identifying and fixing security vulnerabilities in codebases, having recognized 92% of known and artificially injected vulnerabilities [1] - Aardvark's workflow includes threat modeling, submission scanning, sandbox validation, and Codex repair, utilizing LLM reasoning capabilities to operate like human security researchers [1] - Major tech companies such as Google, Anthropic, and Microsoft have also released similar white hat agents in October to address the increasing number of vulnerabilities and the sophistication of attack methods in the AI era [1] Group 2: AI Programming Models - The AI programming application Cursor and Windsurf's newly released models, Composer-1 and SWE-1.5, are suspected to be based on Chinese models, with Cursor showing a tendency to respond in Chinese [2] - Users discovered that Cursor Composer-1 employs the same tokenizer as DeepSeek, while Windsurf's claims of being self-developed were contradicted by its ties to the GLM model developed by Zhiyu [2] - Chinese open-source models dominate performance rankings, filling the top 5 and even top 10, making them a rational choice for startups due to their cost-effectiveness [2] Group 3: Attention Mechanisms in AI Models - Linear attention mechanisms are making a comeback, with domestic models like MiniMax-M1, Qwen3-Next, and DeepSeek V3.2 adopting linear or sub-quadratic attention variants [3] - The new MiniMax model M2 has reverted to traditional attention, citing accuracy issues with linear attention in reasoning and multi-turn dialogue tasks [3] - Kimi Linear proposes a hybrid attention strategy, combining three linear attention blocks with one full attention block, achieving a 75% reduction in KV cache and up to a 6x increase in decoding throughput [3] Group 4: Canva's AI Innovations - Canva, valued at $42 billion, has introduced a self-training foundational model capable of producing complete design files with editable layers and has made the acquired Affinity tool permanently free [4] - The core feature, Ask @Canva, is deeply integrated into the design interface, allowing users to modify elements using natural language, with AI also providing suggestions for design improvements [4] - Canva's annual revenue is approximately $3 billion, with over 240 million monthly active users, and it is expected to go public in 2026, directly competing with Adobe for a 70% market share [4] Group 5: Neuralink's Ambitions - Elon Musk announced that the first Neuralink recipient, Noland Arbaugh, may be the first to receive upgrades or dual chip implants, predicting that Neuralink users could eventually outperform others in gaming [5] - Neuralink has had 12 users with a cumulative usage of over 2,000 days and a total active time exceeding 15,000 hours, with research results from the first three trial participants submitted to the New England Journal of Medicine [5] - The company has initiated a new clinical trial called "thought-to-text," aiming to implant 20,000 individuals annually by 2031, targeting annual revenue exceeding $1 billion and applications for healthy individuals starting in 2030 [5] Group 6: AI in Speech Therapy - A research team from Stanford University tested 15 mainstream models for speech disorder recognition, with the best-performing model achieving only 55% accuracy, below the FDA's clinical standard of 80-85% [6] - The study revealed biases in the models, with better performance on male voices compared to female, and English speakers outperforming those using other languages, as well as older children over younger ones [6] - Fine-tuning techniques have shown promise, with performance accuracy improving by 10% after utilizing a small dataset of children's speech for fine-tuning, indicating the potential of multimodal language models in speech pathology applications [6] Group 7: AI Workflow Transformation - Brex, valued at $12.3 billion, is transforming its internal AI platform into a product, built on Retool and reusing external AI capabilities, maintained by a 25-person systems engineering team [7] - The COO is restructuring the operational workflow, delegating L1 tasks to AI, shifting L2 roles from managers to managing agents, and evolving L3 responsibilities from problem-solving to system design, predicting a 5 to 10 times increase in operational efficiency [7] - Recruitment strategies are shifting from favoring specialists to generalists, with interviews focusing on AI usage habits, requiring AI case studies, and assessing AI application capabilities through real business challenges [7] Group 8: OpenAI's Restructuring - OpenAI has completed a restructuring, with a non-profit foundation holding shares valued at $130 billion, becoming one of the largest charitable foundations globally, with an initial investment of $25 billion for healthcare and AI safety [8] - A new agreement stipulates that OpenAI's current and future AGI model APIs will be exclusively deployed on Azure for seven years, with Microsoft holding approximately 32.5% of OpenAI's shares valued at around $135 billion [8] - Both parties have signed a $250 billion pre-purchase contract for Azure, with Microsoft's capital expenditure reaching $34.9 billion last quarter, a 40% increase from the previous quarter, primarily directed towards new data centers and AI chip procurement [8] Group 9: Legal Issues Surrounding OpenAI - Ilya Sutskever testified for nearly 10 hours in the lawsuit filed by Elon Musk against OpenAI [9] - Ilya submitted a 52-page memorandum detailing allegations against Altman, including accusations of deceiving the board, sowing discord, creating chaos, and enabling the growth of Anthropic [9] - Following Altman's dismissal, the board seriously considered the possibility of merging with Anthropic and appointing Dario Amodei as CEO, but this plan fell through due to operational challenges and a revolt from 700 employees [10]
刚刚,Kimi开源新架构,开始押注线性注意力
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the advancements in attention mechanisms, particularly focusing on the Kimi Linear architecture, which combines linear attention and full attention to improve efficiency and performance in various tasks [1][2][4]. Group 1: Kimi Linear Architecture - Kimi Linear introduces a new hybrid linear attention architecture called Kimi Delta Attention (KDA), which optimizes memory usage in limited state RNNs through a more efficient gating mechanism [4][10]. - The architecture features a 3:1 ratio of KDA layers to periodic full attention layers, significantly reducing memory usage while maintaining or exceeding the quality of full attention [10][32]. - Kimi Linear has a total of 48 billion parameters, with 3 billion activated parameters, and can handle context lengths of up to 1 million tokens [5][10]. Group 2: Performance and Efficiency - Kimi Linear demonstrates superior performance across various tasks, outperforming traditional full attention methods, especially in long-context tasks, by reducing the need for large key-value caches by up to 75% [5][10]. - The model achieves a decoding throughput that is six times faster than complete multi-head attention models when processing long contexts [5][59]. - In comparative evaluations, Kimi Linear consistently outperforms baseline models like MLA and GDN-H in general knowledge, reasoning, and Chinese tasks [44][49]. Group 3: Technical Innovations - The KDA mechanism introduces fine-grained control over memory decay and position awareness, enhancing the model's expressiveness and efficiency [20][24]. - The architecture employs a block-wise recursive and intra-block parallel strategy to maximize matrix multiplication throughput, leveraging Tensor Cores effectively [26][59]. - The NoPE (No Position Encoding) design in Kimi Linear allows for efficient long-context training by delegating position information responsibilities to KDA layers [34][39]. Group 4: Experimental Results - Kimi Linear achieved the highest average scores in long-context benchmarks, demonstrating its effectiveness in handling extensive sequences [52][53]. - In reinforcement learning scenarios, Kimi Linear showed faster and better performance improvements compared to MLA, particularly in mathematical reasoning tasks [56][57]. - The model's efficiency remains high, with negligible latency overhead compared to GDN-H during pre-filling, while showing significant speed advantages as sequence lengths increase [59][60].