Workflow
稀疏注意力
icon
Search documents
DeepSeek的小更新,暴打了OpenAI,追上了Gemini
3 6 Ke· 2025-12-03 00:58
Core Insights - DeepSeek has launched two new models, DeepSeek V3.2 and DeepSeek-V3.2-Speciale, which are designed to compete with leading models like GPT-5 and Gemini [1][5][20]. Model Performance - DeepSeek V3.2 has shown competitive performance in various benchmarks, achieving scores close to or surpassing those of GPT-5 and Gemini in several tests [6][20]. - The model's performance in specific benchmarks includes: - AIME 2025: DeepSeek V3.2 scored 93.1, while DeepSeek V3.2-Speciale scored 96.0 [6]. - HMMT Feb 2025: DeepSeek V3.2 scored 92.5, and DeepSeek V3.2-Speciale scored 99.2 [6]. - Overall, DeepSeek V3.2-Speciale is noted for its ability to compete effectively with Gemini 3 [20][27]. Technological Innovations - DeepSeek has implemented Sparse Attention (DSA) in its models, which allows for more efficient processing of longer texts by reducing computational complexity [9][13]. - The company has focused on enhancing post-training processes for open-source models, investing over 10% of total training compute to improve model performance in challenging tasks [17][21]. - DeepSeek V3.2 Speciale encourages longer reasoning without penalizing the model for extended thought processes, enhancing its ability to tackle complex problems [18][20]. Cost Efficiency - Despite higher token consumption compared to competitors, DeepSeek offers a more cost-effective solution, with a significant price advantage over models like Gemini [32][33]. - For example, using 8077 tokens on DeepSeek costs approximately $0.0032, while using 4972 tokens on Gemini costs around $0.06, highlighting a 20-fold price difference [33]. Industry Context - The gap between open-source and closed-source models is reportedly widening, but DeepSeek is actively working to close this gap through innovative approaches and cost-saving measures [35][36]. - The company's strategy emphasizes algorithmic improvements over merely increasing computational power, aligning with industry insights on the importance of efficient model training [38][39].
再谈注意力:阿里、Kimi 都在用的 DeltaNet 和线性注意力新改进丨晚点播客
晚点LatePost· 2025-12-02 09:13
Core Insights - The article discusses advancements in linear attention mechanisms, particularly DeltaNet, which aims to improve the efficiency and effectiveness of large language models (LLMs) by reducing the computational complexity associated with traditional attention mechanisms [5][10][12]. Group 1: Linear Attention Mechanisms - Linear attention mechanisms, such as DeltaNet, were introduced to address the computational bottleneck of traditional attention mechanisms, which exhibit quadratic complexity with respect to input length [5][12]. - DeltaNet's development has been a collaborative effort, with significant contributions from researchers since its inception in 2021, focusing on improving the update rules and parallelization of linear attention [7][20][21]. - The recent open-source releases of Qwen3-Next and Kimi Linear models by Alibaba and Kimi, respectively, incorporate linear attention mechanisms, indicating a shift towards these more efficient models in flagship applications [5][24]. Group 2: DeltaNet and Its Evolution - DeltaNet was initially overlooked due to a lack of key architectural improvements and suboptimal implementations, but recent advancements have led to its increased adoption in industry [20][24]. - The introduction of the Gated DeltaNet variant enhances memory control and retrieval performance, making it more suitable for modern hardware [7][21][24]. - The relationship between DeltaNet and other models, such as Kimi Linear, highlights the trend of integrating linear attention with traditional full attention mechanisms to balance speed and capacity [24][25]. Group 3: Future Directions and Challenges - The article emphasizes the need for further exploration of update rules in linear attention mechanisms, suggesting that improvements in this area could lead to better performance and scalability [48][49]. - There is a discussion on the potential of combining sparse attention with linear attention to address long-text processing challenges, which remains a significant hurdle in current models [46][49]. - The ongoing debate in the industry regarding the effectiveness of linear versus full attention mechanisms reflects the complexities and trade-offs involved in model design for various applications [27][30].
DeepSeek的新模型很疯狂:整个AI圈都在研究视觉路线,Karpathy不装了
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the groundbreaking release of the DeepSeek-OCR model, which compresses 1000 words into 100 visual tokens while maintaining a high accuracy of 97% [1] - This model addresses the long-context efficiency issue in large language models (LLMs) and suggests a paradigm shift where visual inputs may be more effective than textual inputs [1][5] Group 1: Model Features and Performance - DeepSeek-OCR can process 200,000 pages of data daily using a single NVIDIA A100 GPU [1] - The model's compression efficiency is ten times better than traditional text tokens, allowing for a significant reduction in the number of tokens needed to represent information [9] - The model eliminates the need for tokenizers, which have been criticized for their complexity and inefficiency [6] Group 2: Community Reception and Expert Opinions - The open-source nature of DeepSeek-OCR has led to widespread validation and excitement within the AI community, with over 4000 stars on GitHub shortly after its release [2][1] - Experts like Andrej Karpathy have praised the model, highlighting its potential to redefine how LLMs process inputs [3][5] - The model has sparked discussions about the efficiency of visual tokens compared to text tokens, with some researchers noting that visual representations may offer better performance in certain contexts [9][11] Group 3: Implications for Future Research - The article suggests that the use of visual tokens could significantly expand the effective context length of models, potentially allowing for the integration of extensive internal documents into prompts [12][13] - There are references to previous research that laid the groundwork for similar concepts, indicating that while DeepSeek-OCR is innovative, it is part of a broader trend in the field [18][20] - The potential for combining DeepSeek-OCR with other recent advancements, such as sparse attention mechanisms, is highlighted as a promising avenue for future exploration [11][12]
人工智能专题:DeepSeek的稀疏注意力机制给AI产业释放更大的发展潜能
Zhongyuan Securities· 2025-10-16 11:46
Investment Rating - The industry investment rating is "Outperform the Market" with an expected increase of over 10% relative to the CSI 300 index in the next six months [41]. Core Insights - The report emphasizes that the introduction of sparse attention mechanisms, particularly through DeepSeek, significantly enhances the development potential of the AI industry [8][37]. - DeepSeek's advancements in attention mechanisms, including Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA), are pivotal in improving model performance and efficiency [18][23][37]. Summary by Sections 1. Relationship Between Attention Mechanism and Large Model Development - The attention mechanism, introduced to improve information processing efficiency, has become a core component of large models, addressing the limitations of traditional recurrent neural networks [11]. - Sparse attention reduces computational complexity from O(L²) to sub-quadratic levels, thus overcoming memory and computational bottlenecks [11]. 2. DeepSeek's Technological Improvements in Attention Mechanism - DeepSeek has made significant contributions in three main areas: Multi-head Latent Attention (MLA), Native Sparse Attention (NSA), and DeepSeek Sparse Attention (DSA) [12][18][23]. - MLA reduces memory usage by approximately 90% while maintaining model performance, significantly lowering training costs [16]. - NSA enhances long text processing speed by 11 times and achieves performance comparable to traditional models [18]. - DSA improves training and inference efficiency, leading to substantial cost reductions for model usage [23]. 3. DSA and NSA Unlock Greater Development Potential for the AI Industry - The integration of DSA and NSA allows for expanded model context and improved computational efficiency, which are crucial for meeting the demands of multi-modal applications [33][37]. - The trend towards longer input and output lengths necessitates innovative approaches to model training and performance enhancement [33].
第二代InfLLM开源,同尺寸快三倍,零参数,可训练稀疏注意力
3 6 Ke· 2025-10-09 12:12
Core Insights - InfLLM-V2 is an efficient sparse attention model designed to handle long texts with minimal data, achieving performance close to traditional dense models [1][2] - The model allows seamless switching between short and long text processing modes, significantly enhancing efficiency and quality for long-context tasks [1][2] - InfLLM-V2 demonstrates a fourfold speed improvement over dense attention mechanisms while maintaining 98.1% performance in long text understanding tasks and 99.7% in deep reasoning tasks [1][2] Summary by Sections Model Advantages - Low-cost training requires only 5 billion long text tokens to achieve sparse attention capability, resulting in reduced training costs and shorter adaptation cycles [2] - The model supports seamless switching from dense to sparse attention without adding parameters, aligning with mainstream training paradigms for stability and faster convergence [2] - Efficient operator implementation optimizes the time bottleneck in sparse attention through hardware-friendly designs, significantly reducing HBM I/O and computational overhead [2] Technical Mechanism - InfLLM-V2 replaces the dense attention paradigm, where each query interacts with all keys, with a sparse approach that limits interactions to a selected subset, thus reducing computational costs [3][4] - The model introduces a two-step process: block selection to determine relevant key-value subsets and sparse attention calculation only on selected subsets [4][6] Performance Evaluation - In long text understanding tasks, InfLLM-V2 matches the performance of dense attention models, while other sparse attention methods show performance degradation [9] - In deep reasoning tasks, InfLLM-V2 achieves comparable performance to dense attention, while NSA methods negatively impact model effectiveness [11] - Efficiency tests reveal that InfLLM-V2 can achieve 4-9 times acceleration in operator speed compared to dense attention, with significant improvements in prefill and decode phases [13][17] Future Developments - The company plans to continue optimizing the training and inference operators of InfLLM-V2 and integrate it into mainstream inference frameworks [20]
万亿的OpenAI,涨疯的Memory和新出炉的DeepSeek
傅里叶的猫· 2025-09-29 15:11
Group 1 - OpenAI is projected to become a trillion-dollar company, with significant investments in AI infrastructure and data centers [2][4][3] - OpenAI plans to invest $1 trillion globally to build data centers to meet future demand for over 20GW of computing power, with costs estimated at $500 billion per GW [4][5] - OpenAI's CEO emphasizes the massive energy and infrastructure requirements for next-generation AI, equating it to the power needs of over 13 million American households [3][4] Group 2 - The rising prices of memory components, particularly DDR, are impacting server businesses, leading to renegotiations of pricing with clients [6][10] - Major manufacturers like Samsung and SK Hynix are reducing DDR4 production in favor of more profitable DDR5 and HBM memory, contributing to price increases [10] - OpenAI's announcement of new AI data centers in the U.S. is expected to further drive demand for memory components, resulting in price hikes for DDR5 and NAND Flash [10][14] Group 3 - The DeepSeek V3.2-Exp model introduces sparse attention mechanisms to improve computational efficiency, leading to a 50% reduction in API service costs [22][28] - The model's performance remains comparable to previous versions, with some specific improvements in structured tasks, although there are noted regressions in certain areas [29][34] - The introduction of various kernel implementations for DeepSeek aims to optimize performance for different use cases, balancing speed and complexity [31][32]
反直觉: MoE混合专家模型和场景没什么关系
理想TOP2· 2025-08-28 16:01
Core Viewpoint - The MoE (Mixture of Experts) model is fundamentally a sparse attention mechanism aimed at improving computational efficiency, rather than a model where each expert corresponds to a specific scenario [1][2]. Group 1: Scene Limitations - Having multiple MoE sub-models does not mean they can only handle specific scenes; it is impractical to train separate models for each scenario under the one model paradigm [1]. - If models are divided by scene, it does not represent a true MoE structure [1]. Group 2: Uniform Distribution - If only one type of scenario is run, a significant portion of the model's parameters may remain unused, leading to inefficiencies [2]. - It is more effective to distribute tasks evenly among experts rather than assigning specific experts to specific tasks, as low-usage experts may not justify their inclusion [2]. Group 3: Multiple Experts Activation - The MoE model can activate multiple experts simultaneously, allowing for a more even distribution of computational resources and addressing more complex problems effectively [2]. - The essence of the MoE model lies in the fact that only a small number of parameters significantly influence the output, making it a sparse model that enhances computational efficiency [2]. Group 4: Understanding the Model - Describing different experts as being suited for specific scenarios is a simplification that aids understanding, but it does not reflect the intentional design of the model [3].
R2还没来,但DeepSeek的秘密武器已经“剧透”了
Hu Xiu· 2025-07-31 07:58
Core Insights - The top conference in the field of natural language processing, ACL, awarded the best paper to a joint work by DeepSeek and Peking University titled "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention" [4][3] - This paper introduces a significant advancement in the efficiency of large language models, achieving up to 11 times faster inference while maintaining model performance [5][34] Group 1: Technology and Innovation - The paper presents a novel approach to sparse attention, moving from theoretical reasoning to a complete training process, which is crucial for the future of large models [5][26] - The Native Sparse Attention (NSA) method mimics human reading strategies by compressing long texts, selecting relevant details, and maintaining a sliding window of recent context [26][30] - NSA is designed to be natively trainable, allowing the model to learn efficient attention distribution from the pre-training phase [32][51] Group 2: Performance Metrics - In various benchmark tests, the 27B model utilizing NSA outperformed traditional full attention models in 7 out of 9 metrics, particularly excelling in reasoning tasks [35][37] - The NSA method achieved a 100% information retrieval accuracy in long text comprehension tasks, demonstrating its effectiveness in handling extensive data [38][40] - Training speed improved significantly, with forward computation accelerated by 9 times and backward propagation by 6 times, while inference speed saw an impressive 11.6 times increase [44][45] Group 3: Market Implications - The advancements in NSA technology position DeepSeek as a potential leader in the AI application ecosystem, promising faster, more efficient, and cost-effective solutions for users [55][58] - The ability to process extensive documents and datasets without manual segmentation could revolutionize how users interact with AI, enhancing productivity and accessibility [54][59] - The competitive edge provided by NSA technology is expected to solidify DeepSeek's market position, transforming it from a price-driven player to a technology innovator [58][60]
知乎平台已沉淀858万个AI相关问题、2088万个AI专业回答丨聚焦WAIC 2025
Guo Ji Jin Rong Bao· 2025-07-27 12:23
Core Insights - The rise of AI developers has made Zhihu a primary platform for launching projects and discussing AI advancements, with significant engagement from the community [1][3][4] Group 1: Community Engagement - Zhihu has attracted 16 million continuous learners in the technology and AI fields, along with 3.56 million deep creators in these topics, accumulating 8.58 million AI-related questions and 20.88 million professional answers [1] - Several AI companies have actively engaged on Zhihu, including DeepSeek's exclusive release of a technical article and the launch of humanoid robot Lingxi X2 by Zhihu's user "Zhihui Jun" [3] Group 2: Events and Interactions - During the WAIC 2025, Zhihu showcased a multi-dimensional interactive exhibition highlighting AI technology discussions and engaging activities like "Knowledge King PK" [4] - Zhihu organized a "Developer Recovery Night" event where numerous AI developers shared insights and experiences, emphasizing the transformative impact of large models on embodied intelligence technology [5] Group 3: Collaborations and Publications - Zhihu collaborated with 14 AI companies to release the "AI World Handbook," aiming to provide insights into the AI ecosystem [4]
3700 次预训练寻找 “线性注意力” 非共识,MiniMax-01 开发者讲述 4 年探索
晚点LatePost· 2025-03-09 12:00
"我们跑的是下半场,赌的就是未来的长文本需求。" MiniMax 在今年 1 月发布了参数为 4560 亿的开源大模型 MiniMax-01,该模型就用到了他们开发的线 性注意力机制 "Lightning Attention"。 我们邀请了这个项目的负责人,MiniMax 高级研究总监钟怡然,来与我们一起聊线性注意力的研发过 程。钟怡然在 MiniMax 负责大模型网络架构设计,目前正开发多模态深度推理模型。 钟怡然曾担任上海人工智能实验室青年科学家,是新架构探索组的 PI(项目负责人);他在澳洲国立大 学获得博士学位,师从李宏东教授和 Richard Hartley 院士。他和他的团队已在一些国际顶级学术会议和 期刊上发表了 20 余篇关于模型新架构的论文,覆盖了当前多类非 Transformer 架构,如线性注意力机制 (线性注意力)、长卷积(Long Convolution)和线性循环网络(Linear RNN)。 在 2021 年,线性注意力还是一个 "看起来很美好的泡泡",怡然和团队就开始探索线性架构的实现。 嘉宾 丨 钟怡然 整理 丨 刘倩 程曼祺 上期播客中, 我们与清华的两位博士生,肖朝军和傅 ...