稀疏注意力机制
Search documents
DeepSeek新模型降价:优化推理效率,API价格降超50%
YOUNG财经 漾财经· 2025-09-30 06:25
Core Insights - DeepSeek has launched the new DeepSeek-V3.2-Exp model, which significantly reduces API costs by over 50% [2][3][4] Group 1: Model Release and Features - The DeepSeek-V3.2-Exp model is an experimental version that builds on the previous V3.1-Terminus, introducing the DeepSeek Sparse Attention mechanism to enhance training and inference efficiency for long texts [3][4] - The new model maintains performance levels comparable to V3.1-Terminus across various public evaluation datasets, despite the introduction of the sparse attention mechanism [4] Group 2: Cost Reduction and Pricing - The introduction of the new model has led to a substantial decrease in service costs, with API pricing dropping by more than 50%. Specific price changes include input cache hits reduced from 0.5 yuan to 0.2 yuan per million tokens, cache misses from 4 yuan to 2 yuan per million tokens, and output costs from 12 yuan to 3 yuan per million tokens [4] Group 3: Research and Development - The development of the DeepSeek-V3.2-Exp model involved designing new GPU operators and utilizing the TileLang programming language for rapid prototyping, which supports deeper exploration of model capabilities [4] - DeepSeek's research on the DeepSeek-R1 model, which focuses on incentivizing reasoning capabilities in large language models through reinforcement learning, was featured on the cover of the prestigious journal Nature [7]
DeepSeek新版本API价格下调 寒武纪:对新模型DeepSeek
Zhong Guo Zheng Quan Bao· 2025-09-30 00:09
Core Insights - DeepSeek has released the experimental version DeepSeek-V3.2-Exp, which introduces Sparse Attention for improved training and inference efficiency on long texts [1][2] - The new model has led to a significant reduction in service costs, with API prices dropping by over 50% for developers [1] - Cambricon has quickly adapted to the new model and open-sourced the vLLM-MLU inference engine, allowing developers to experience the new features immediately [1][2] - Huawei Ascend has also achieved 0-day support for DeepSeek-V3.2-Exp, optimizing deployment on the CANN platform and maintaining low inference generation speeds [3] Group 1 - DeepSeek-V3.2-Exp introduces Sparse Attention for enhanced efficiency [1] - API costs for developers have been reduced by over 50% [1] - Cambricon has achieved day 0 adaptation for the new model [2] Group 2 - Huawei Ascend has completed the adaptation and optimization for DeepSeek-V3.2-Exp [3] - The deployment strategy utilizes DeepSeek's large EP parallel scheme [3] - Inference generation speeds are maintained below 2 seconds for TTFT and 30 milliseconds for TPOT on long sequences [3]
DeepSeek最新模型上线,全新注意力机制基于北大ACL最佳论文
3 6 Ke· 2025-09-29 23:39
Core Insights - DeepSeek has launched its latest experimental model, DeepSeek-V3.2-Exp, featuring a new attention mechanism called DeepSeek Sparse Attention (DSA), which improves training and inference efficiency while reducing API costs by over 50% [1][19]. Model Features - The V3.2 model builds on DeepSeek-V3.1-Terminus and introduces DSA, achieving faster and more efficient training and inference for long contexts [3][5]. - DSA is the first key technology branded under "DeepSeek" and is an improvement over the Native Sparse Attention (NSA) from a previous collaboration with Peking University [3][5]. - The DSA mechanism allows the model to focus on a small subset of important tokens rather than all tokens, significantly reducing computational complexity from O(L²) to O(Lk), where k is much smaller than L [8][10]. Performance Evaluation - Evaluation results indicate that DeepSeek-V3.2-Exp maintains performance levels comparable to its predecessor, with no significant decline in effectiveness across both short and long text tasks [14][15]. - Specific benchmark results show that while some metrics slightly decreased, others improved, indicating a balanced performance across various tasks [15]. Cost Efficiency - The introduction of DSA has led to substantial reductions in operational costs, with the API price being lowered by over 50% for developers [19]. - The model's deployment has demonstrated significant end-to-end acceleration and cost savings in inference [18]. Future Implications - Although still an experimental model, DeepSeek-V3.2-Exp presents a promising engineering pathway for overcoming long text processing challenges without sacrificing performance [18].
DeepSeek-V3.2-Exp发布 API成本将降低50%以上
Feng Huang Wang· 2025-09-29 14:07
Core Insights - DeepSeek has released the V3.2-Exp model, which introduces a Sparse Attention mechanism aimed at optimizing training and inference efficiency for long texts [1] - The official app, web version, and mini-program have all been updated to DeepSeek-V3.2-Exp, and the API has seen a significant price reduction [1] - Under the new pricing policy, the cost for developers to access the DeepSeek API will decrease by over 50% [1] - The performance of DeepSeek-V3.2-Exp on various public evaluation datasets is comparable to that of V3.1-Terminus [1]
深度求索正式发布DeepSeek-V3.2-Exp模型
Bei Jing Shang Bao· 2025-09-29 12:58
Core Insights - DeepSeek officially released the DeepSeek-V3.2-Exp model, which introduces a Sparse Attention mechanism aimed at optimizing training and inference efficiency for long texts [1] - The official app, web version, and mini-program have all been updated to DeepSeek-V3.2-Exp, reflecting the latest enhancements [1] - The pricing policy for the DeepSeek API has been significantly reduced, with costs for developers decreasing by over 50% [1]
DeepSeek-V3.2-Exp模型发布并开源,API价格大幅下调
3 6 Ke· 2025-09-29 12:12
Core Insights - DeepSeek-V3.2-Exp model has been officially released and open-sourced, featuring significant updates in architecture and efficiency [1][4] - The introduction of DeepSeek Sparse Attention (DSA) aims to enhance training and inference efficiency for long texts without compromising output quality [1][5] - The API costs for developers have been reduced by over 50% due to the new model's service cost decrease [4] Group 1: Model Features - DeepSeek-V3.2-Exp is an experimental version that builds on V3.1-Terminus, incorporating a sparse attention mechanism [1] - The model achieves fine-grained sparse attention, significantly improving long text training and inference efficiency [1] - The new model's performance is comparable to V3.1-Terminus across various public evaluation datasets [5] Group 2: Development and Implementation - The development of the new model required the design and implementation of numerous new GPU operators, utilizing TileLang for rapid prototyping [2] - The open-sourced operators include both TileLang and CUDA versions, with a recommendation for the community to use the TileLang version for easier debugging [2] Group 3: Previous Versions and Improvements - DeepSeek-V3.1 was released on August 21, featuring a mixed inference architecture and improved efficiency compared to DeepSeek-R1-0528 [4] - The subsequent update to DeepSeek-V3.1-Terminus on September 22 addressed user feedback, enhancing language consistency and agent capabilities [4]
刚刚,DeepSeek开源V3.2-Exp,公开新稀疏注意力机制DSA
机器之心· 2025-09-29 10:29
Core Viewpoint - DeepSeek has released the experimental version DeepSeek-V3.2-Exp, which introduces a new sparse attention mechanism aimed at optimizing training and inference efficiency in long-context scenarios [3][5][10]. Summary by Sections Model Release - DeepSeek-V3.2-Exp has been open-sourced with a parameter count of 685 billion [3]. - The release includes a paper detailing the new sparse attention mechanism [5]. Sparse Attention Mechanism - The DeepSeek Sparse Attention (DSA) is the only architectural improvement in version 3.2, focusing on enhancing computational efficiency when processing extended text sequences [5][6][10]. - DSA achieves fine-grained sparse attention while maintaining nearly the same output quality as its predecessor, DeepSeek-V3.1-Terminus [9]. Performance Comparison - A comparison of benchmark results between DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp shows that the new version performs comparably across various tasks [11]. - Specific benchmark results include: - MMLU-Pro: 85.0 (V3.1) vs. 85.0 (V3.2) - AIME 2025: 88.4 (V3.1) vs. 89.3 (V3.2) - Codeforces: 2046 (V3.1) vs. 2121 (V3.2) [11]. Future Developments - The upcoming release of Z.ai's GLM-4.6 model is noted, with GLM-4.5 being the previous flagship model [12].
用短视频成本生成长视频,字节Seed新注意力机制让计算量降低85%
Sou Hu Cai Jing· 2025-09-02 05:45
Core Insights - ByteSeed, in collaboration with Stanford researchers, has introduced a new model that significantly reduces the computational cost of generating long videos by 85% while maintaining quality and coherence in characters and scenes [1][3]. Group 1: Technology Overview - The new model employs a sparse attention mechanism called Mixture of Contexts (MoC), which redefines long video generation as a context retrieval task [1][3]. - MoC allows for the generation of a one-minute 480P video with only 2.32×10¹² FLOPs, compared to the baseline model's 1.66×10¹³ FLOPs, achieving an 85% reduction in computational load [3]. - For shorter videos, MoC also demonstrates cost-saving capabilities, with a multi-shot 64-second 480P video requiring only 2.3×10² FLOPs, saving approximately 86% compared to the baseline [3]. Group 2: Mechanism Details - MoC's core mechanism involves segmenting cross-modal sequences into semantically homogeneous content blocks, enhancing retrieval accuracy and reducing unnecessary computations [4][6]. - The model utilizes a dynamic top-k routing process, where only the most relevant blocks are retained for attention, optimizing the computational efficiency without adding parameters [6][7]. - To prevent information retention and ensure smooth long-range dynamics, strict temporal masks are implemented, prohibiting queries from accessing their own or subsequent blocks [6][7]. Group 3: Performance Metrics - The MoC method outperforms baseline models in various performance metrics, including theme consistency, background coherence, action continuity, and image quality [3][4]. - In a single-shot 8-second 320×192 video test, MoC required 4.1×10⁹ FLOPs, representing a reduction of approximately 78% compared to the baseline's 1.9×10¹⁰ FLOPs [3]. Group 4: Engineering Implementation - MoC integrates selected key values into FlashAttention variable-length kernels, enabling linear scalability for millions of tokens and efficient parallel processing on GPUs [6][7]. - The model ensures that all visual tokens can access complete text prompts, maintaining thematic consistency and enhancing editability [7].
用短视频成本生成长视频,字节Seed新注意力机制让计算量降低85%
量子位· 2025-09-02 04:17
Core Viewpoint - The article discusses a new model developed by ByteSeed in collaboration with Stanford researchers that significantly reduces the computational cost of generating long videos while maintaining quality and coherence [1][2]. Group 1: Cost Reduction in Video Generation - The new model allows for the generation of long videos at a cost comparable to that of short videos, achieving an 85% reduction in computational requirements [1][10]. - For example, generating a one-minute 480P video using the Mixture of Contexts (MoC) mechanism requires only 2.32×10¹² FLOPs, compared to 1.66×10¹³ FLOPs for the baseline model [10]. - The MoC mechanism also demonstrates similar cost-saving effects for short videos, with a 64-second multi-shot video requiring 2.3×10¹² FLOPs versus 1.7×10¹³ FLOPs for the baseline, resulting in approximately 86% savings [11]. Group 2: Quality and Consistency - The generated long videos maintain subject and background consistency, motion smoothness, and overall image quality, outperforming the baseline model across various performance metrics [12]. - In a single-shot 8-second 320×192 video test, the MoC model achieved a reduction of approximately 78% in computational load, requiring only 4.1×10⁹ FLOPs compared to 1.9×10¹⁰ FLOPs for the baseline [14]. Group 3: Mechanism of MoC - The MoC mechanism redefines long video generation as an information retrieval task, focusing on efficient cross-temporal memory retrieval [3][15]. - It employs a sparse attention mechanism that segments video sequences into semantically homogeneous content blocks, allowing each query token to connect only with the most relevant blocks [15][16]. - The model incorporates a "content alignment chunking" process to enhance retrieval accuracy and reduce unnecessary computational waste [19]. Group 4: Engineering Implementation - The MoC model is designed to prevent information retention issues by enforcing strict temporal masks during the routing phase, ensuring that queries do not access future blocks [20]. - The implementation utilizes FlashAttention for efficient memory access and parallel processing on GPUs, allowing for scalable performance with millions of tokens [20].
DeepSeek V4 借实习生获奖论文“起飞”?梁文峰剑指上下文:处理速度提10倍、要“完美”准确率
AI前线· 2025-07-31 05:02
Core Viewpoint - The article highlights the significant achievements of Chinese authors in the field of computational linguistics, particularly focusing on the award-winning paper from DeepSeek that introduces a novel sparse attention mechanism for long-context modeling, showcasing its efficiency and performance improvements over traditional methods [1][17]. Group 1: Award and Recognition - The ACL announced that over 51% of the award-winning papers for 2025 had Chinese authors, with the USA at 14% [1]. - A paper by DeepSeek, led by author Liang Wenfeng, won the Best Paper award, which has generated considerable discussion [1]. Group 2: Technical Innovations - The paper introduces a Natively Trainable Sparse Attention (NSA) mechanism, which combines algorithmic innovation with hardware optimization for efficient long-context modeling [4][6]. - NSA employs a dynamic hierarchical sparse strategy that balances global context awareness with local precision through token compression and selection [11]. Group 3: Performance Evaluation - NSA demonstrated superior performance in various benchmarks, outperforming traditional full attention models in 7 out of 9 metrics, particularly in long-context tasks [8][10]. - In a "needle in a haystack" test with 64k context, NSA achieved perfect retrieval accuracy and significant speed improvements in decoding and training processes [9][15]. Group 4: Future Implications - The upcoming DeepSeek model is expected to incorporate NSA technology, generating anticipation for its release [17]. - There are speculations regarding the delay of DeepSeek R2's release, attributed to the founder's dissatisfaction with its current performance [17].