Linear Attention
Search documents
Kimi Linear一作张宇:关于模型训练的一些感想
自动驾驶之心· 2025-11-06 00:04
作者 | yzhangcs@知乎 编辑 | 青稞AI 原文链接:https://www.zhihu.com/question/1967345030881584585/answer/1967730385816385407 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我-> 领取大模型巨卷干货 本文只做学术分享,如有侵权,联系删文 ,欢迎添加小助理微信AIDriver004做进一步咨询 终于忙完了 Kimi Linear 的 Model Card 和 Paper ArXiv 上传,放空了半天。现在稍微分享一下个人感想,顺便做一些澄清。 Paper:https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf代码:https://github.com/Moonshot 模型架构 模型整体架构设计如图所示,延续了 Moonlight 的设计思路,别的回答已经有不少优秀的解读了。这次最大的不同在于我们将MoE的稀疏度设置 得更激进,从8到32。 而 Kimi Linear 的核心设计原则,第一主要采用Linear Attenti ...
哈工大孟维康:让注意力有 “棱角”|Attention
3 6 Ke· 2025-10-20 07:58
Core Insights - The article discusses the evolution and challenges of Linear Attention in the context of Vision Transformers, highlighting the need for improved efficiency and performance in AI models [1][2][3]. Group 1: Linear Attention Challenges - Linear Attention faces two main issues: the distribution of attention weights becomes too flat, reducing model sharpness, and the use of non-negative kernel functions leads to the loss of negative interaction information [2][9]. - The traditional Self-Attention mechanism has high computational costs and energy consumption, making it difficult for smaller teams and companies to compete [1][2]. Group 2: PolaFormer Innovation - PolaFormer introduces a dual-stream architecture that separates positive and negative interactions, allowing for independent processing of these relationships [4][6][10]. - The model employs a learnable channel-wise power function to enhance the sharpness of attention distributions, aiming to recover the expressiveness of Softmax Attention while maintaining efficiency [6][10][20]. Group 3: Experimental Validation - Extensive experiments demonstrate that PolaFormer effectively replaces Self-Attention in Vision Transformer frameworks, showing significant performance improvements across various tasks such as object detection, semantic segmentation, and long sequence benchmarks [7][31]. - The model's design allows it to maintain stable performance across different input types, including short texts and long sequences, without losing global information [9][29]. Group 4: Future Applications and Implications - PolaFormer is expected to enhance applications in long-sequence and high-resolution scenarios, such as video processing and large language models, by providing a more efficient solution without compromising performance [31][32]. - The research emphasizes the importance of co-designing algorithms with hardware to address deployment challenges, particularly in resource-constrained environments [30][31].
小米小爱同学:资源受限下,实现端侧大模型的高性能推理
AI前线· 2025-06-25 04:15
Core Insights - The article discusses the challenges and advancements in deploying large models on edge devices, emphasizing the need for optimization in architecture, systems, and algorithms to meet the high demands of mobile, automotive, and IoT applications [1][3][4] Group 1: Engineering Challenges - Edge devices face significant resource limitations in terms of computing power and bandwidth compared to cloud environments, necessitating low-bit quantization of models for deployment [3][4] - The rapid evolution of large models complicates commercial deployment, as updates and improvements can lag on edge devices due to user-driven update mechanisms [4][5] - The current state of large models is still in a "technology accumulation" phase, with future deployment contingent on advancements in edge computing capabilities and model stability [4][14] Group 2: Performance Optimization - The team developed a self-researched inference framework achieving over 180 tokens/s in real-time inference, utilizing strategies like dynamic input support and speculative decoding to enhance performance [1][6][7] - Techniques such as low-bit quantization and instruction-level optimizations are employed to maximize efficiency on resource-constrained devices [7][12] - The framework supports a shared base model architecture, allowing multiple business applications to utilize a single model while maintaining performance through LoRA modules [10][11] Group 3: Future Directions - Future breakthroughs in edge model deployment are expected to hinge on hardware advancements and the evolution of model architectures, such as Linear Attention, which could alleviate resource constraints [14][16][17] - The emergence of next-generation chips designed for large models is anticipated to significantly enhance the capabilities of edge devices [15][17] - The exploration of new model architectures that reduce memory usage while maintaining performance is crucial, especially for applications requiring long context inputs [16][17]