稀疏注意力

Search documents
万亿的OpenAI,涨疯的Memory和新出炉的DeepSeek
傅里叶的猫· 2025-09-29 15:11
Group 1 - OpenAI is projected to become a trillion-dollar company, with significant investments in AI infrastructure and data centers [2][4][3] - OpenAI plans to invest $1 trillion globally to build data centers to meet future demand for over 20GW of computing power, with costs estimated at $500 billion per GW [4][5] - OpenAI's CEO emphasizes the massive energy and infrastructure requirements for next-generation AI, equating it to the power needs of over 13 million American households [3][4] Group 2 - The rising prices of memory components, particularly DDR, are impacting server businesses, leading to renegotiations of pricing with clients [6][10] - Major manufacturers like Samsung and SK Hynix are reducing DDR4 production in favor of more profitable DDR5 and HBM memory, contributing to price increases [10] - OpenAI's announcement of new AI data centers in the U.S. is expected to further drive demand for memory components, resulting in price hikes for DDR5 and NAND Flash [10][14] Group 3 - The DeepSeek V3.2-Exp model introduces sparse attention mechanisms to improve computational efficiency, leading to a 50% reduction in API service costs [22][28] - The model's performance remains comparable to previous versions, with some specific improvements in structured tasks, although there are noted regressions in certain areas [29][34] - The introduction of various kernel implementations for DeepSeek aims to optimize performance for different use cases, balancing speed and complexity [31][32]
反直觉: MoE混合专家模型和场景没什么关系
理想TOP2· 2025-08-28 16:01
Core Viewpoint - The MoE (Mixture of Experts) model is fundamentally a sparse attention mechanism aimed at improving computational efficiency, rather than a model where each expert corresponds to a specific scenario [1][2]. Group 1: Scene Limitations - Having multiple MoE sub-models does not mean they can only handle specific scenes; it is impractical to train separate models for each scenario under the one model paradigm [1]. - If models are divided by scene, it does not represent a true MoE structure [1]. Group 2: Uniform Distribution - If only one type of scenario is run, a significant portion of the model's parameters may remain unused, leading to inefficiencies [2]. - It is more effective to distribute tasks evenly among experts rather than assigning specific experts to specific tasks, as low-usage experts may not justify their inclusion [2]. Group 3: Multiple Experts Activation - The MoE model can activate multiple experts simultaneously, allowing for a more even distribution of computational resources and addressing more complex problems effectively [2]. - The essence of the MoE model lies in the fact that only a small number of parameters significantly influence the output, making it a sparse model that enhances computational efficiency [2]. Group 4: Understanding the Model - Describing different experts as being suited for specific scenarios is a simplification that aids understanding, but it does not reflect the intentional design of the model [3].
R2还没来,但DeepSeek的秘密武器已经“剧透”了
Hu Xiu· 2025-07-31 07:58
Core Insights - The top conference in the field of natural language processing, ACL, awarded the best paper to a joint work by DeepSeek and Peking University titled "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention" [4][3] - This paper introduces a significant advancement in the efficiency of large language models, achieving up to 11 times faster inference while maintaining model performance [5][34] Group 1: Technology and Innovation - The paper presents a novel approach to sparse attention, moving from theoretical reasoning to a complete training process, which is crucial for the future of large models [5][26] - The Native Sparse Attention (NSA) method mimics human reading strategies by compressing long texts, selecting relevant details, and maintaining a sliding window of recent context [26][30] - NSA is designed to be natively trainable, allowing the model to learn efficient attention distribution from the pre-training phase [32][51] Group 2: Performance Metrics - In various benchmark tests, the 27B model utilizing NSA outperformed traditional full attention models in 7 out of 9 metrics, particularly excelling in reasoning tasks [35][37] - The NSA method achieved a 100% information retrieval accuracy in long text comprehension tasks, demonstrating its effectiveness in handling extensive data [38][40] - Training speed improved significantly, with forward computation accelerated by 9 times and backward propagation by 6 times, while inference speed saw an impressive 11.6 times increase [44][45] Group 3: Market Implications - The advancements in NSA technology position DeepSeek as a potential leader in the AI application ecosystem, promising faster, more efficient, and cost-effective solutions for users [55][58] - The ability to process extensive documents and datasets without manual segmentation could revolutionize how users interact with AI, enhancing productivity and accessibility [54][59] - The competitive edge provided by NSA technology is expected to solidify DeepSeek's market position, transforming it from a price-driven player to a technology innovator [58][60]
知乎平台已沉淀858万个AI相关问题、2088万个AI专业回答丨聚焦WAIC 2025
Guo Ji Jin Rong Bao· 2025-07-27 12:23
Core Insights - The rise of AI developers has made Zhihu a primary platform for launching projects and discussing AI advancements, with significant engagement from the community [1][3][4] Group 1: Community Engagement - Zhihu has attracted 16 million continuous learners in the technology and AI fields, along with 3.56 million deep creators in these topics, accumulating 8.58 million AI-related questions and 20.88 million professional answers [1] - Several AI companies have actively engaged on Zhihu, including DeepSeek's exclusive release of a technical article and the launch of humanoid robot Lingxi X2 by Zhihu's user "Zhihui Jun" [3] Group 2: Events and Interactions - During the WAIC 2025, Zhihu showcased a multi-dimensional interactive exhibition highlighting AI technology discussions and engaging activities like "Knowledge King PK" [4] - Zhihu organized a "Developer Recovery Night" event where numerous AI developers shared insights and experiences, emphasizing the transformative impact of large models on embodied intelligence technology [5] Group 3: Collaborations and Publications - Zhihu collaborated with 14 AI companies to release the "AI World Handbook," aiming to provide insights into the AI ecosystem [4]
3700 次预训练寻找 “线性注意力” 非共识,MiniMax-01 开发者讲述 4 年探索
晚点LatePost· 2025-03-09 12:00
"我们跑的是下半场,赌的就是未来的长文本需求。" MiniMax 在今年 1 月发布了参数为 4560 亿的开源大模型 MiniMax-01,该模型就用到了他们开发的线 性注意力机制 "Lightning Attention"。 我们邀请了这个项目的负责人,MiniMax 高级研究总监钟怡然,来与我们一起聊线性注意力的研发过 程。钟怡然在 MiniMax 负责大模型网络架构设计,目前正开发多模态深度推理模型。 钟怡然曾担任上海人工智能实验室青年科学家,是新架构探索组的 PI(项目负责人);他在澳洲国立大 学获得博士学位,师从李宏东教授和 Richard Hartley 院士。他和他的团队已在一些国际顶级学术会议和 期刊上发表了 20 余篇关于模型新架构的论文,覆盖了当前多类非 Transformer 架构,如线性注意力机制 (线性注意力)、长卷积(Long Convolution)和线性循环网络(Linear RNN)。 在 2021 年,线性注意力还是一个 "看起来很美好的泡泡",怡然和团队就开始探索线性架构的实现。 嘉宾 丨 钟怡然 整理 丨 刘倩 程曼祺 上期播客中, 我们与清华的两位博士生,肖朝军和傅 ...