稀疏注意力架构 - filings, earnings calls, financial reports, news

稀疏注意力架构

Search documents

智通财经网· 2025-09-29 10:53

Core Insights - DeepSeek officially released the experimental version DeepSeek-V3.2-Exp on September 29, which introduces a sparse attention architecture aimed at optimizing training and inference efficiency for long texts [1][2] - The new model has been integrated into various platforms including the official app, web, and mini-programs, with a significant reduction in API costs for developers [1] Group 1 - The DeepSeek-V3.2-Exp model builds on the V3.1-Terminus version and incorporates a fine-grained sparse attention mechanism called DeepSeek Sparse Attention (DSA), which enhances long text training and inference efficiency without compromising output quality [1] - The model is now available on Huawei Cloud's Model as a Service (MaaS) platform, utilizing a large EP parallel deployment scheme to optimize context parallel strategies while maintaining latency and throughput performance [1] Group 2 - The DeepSeek team conducted a rigorous evaluation of the impact of the sparse attention mechanism, ensuring that the training settings of DeepSeek-V3.2-Exp were aligned with V3.1-Terminus, resulting in comparable performance across various public evaluation datasets [2] - The introduction of the new model has led to a significant reduction in API service costs, with developer costs for accessing DeepSeek API decreasing by over 50% under the new pricing policy [2]

Seek .(US:SKLTY)

稀疏注意力架构

Artificial Intelligence

Artificial Intelligence

DeepSeek-V3.2-Exp模型

DeepSeek API

华为云大模型即服务平台MaaS

长文本推理 5 倍提速！面壁MiniCPM4 端侧模型发布，0.5B模型效果秒杀同级

AI前线· 2025-06-12 06:07

Core Viewpoint - The newly released MiniCPM4.0 model series, featuring 8B and 0.5B parameter scales, significantly enhances edge-side performance and adaptability for various terminal scenarios [1][6]. Model Performance - MiniCPM4.0-8B is the first native sparse model with a 5% sparsity, achieving performance comparable to Qwen-3-8B while using only 22% of the training cost [2][4]. - In benchmark tests like MMLU, CEval, and HumanEval, MiniCPM4.0-0.5B outperforms similar models such as Qwen-3-0.6B and Llama 3.2, achieving a rapid inference speed of 600 Token/s [4][6]. Technological Innovations - The model employs a new context-sparse architecture that allows for a 5x speed increase in long text inference and up to 220x in memory-constrained scenarios [6][8]. - MiniCPM4.0 reduces long text cache requirements to just 1/4 of that needed by Qwen3-8B, achieving a 90% model size reduction while maintaining robust performance [8][10]. Model Architecture - The InfLLMv2 sparse attention architecture allows for efficient "sampling" of relevant text segments, reducing computational costs by 90% compared to traditional models [14][15]. - The model features a dual-frequency switching mechanism that optimizes attention modes for long and short texts, enhancing efficiency and accuracy [17]. Deployment and Adaptation - MiniCPM4.0 has been adapted for major chip platforms including Intel, Qualcomm, and Huawei Ascend, and supports various open-source frameworks [10][24]. - The ArkInfer cross-platform deployment framework addresses the challenges of chip fragmentation, providing a versatile solution for model deployment [25]. Data and Training Innovations - The company utilizes a high-density data selection mechanism to construct high-quality datasets, achieving a 90% reduction in validation costs [28][29]. - The training strategy incorporates advanced techniques like FP8 training and chunk-wise rollout to optimize GPU resource utilization [30].

端侧模型

稀疏注意力架构

Artificial Intelligence

Artificial Intelligence

面壁MiniCPM4端侧模型

Qwen-3-8B

Gemma-3-12B

面壁小钢炮4.0发布：性能比肩 Qwen-3-8B，极限220倍提速

Xin Lang Ke Ji· 2025-06-10 09:37

Core Insights - The fourth generation of the "MiniCPM" model, known as MiniCPM 4.0, has been released, featuring two parameter scales: 8B and 0.5B, achieving the best performance in its class [2][3] - MiniCPM 4.0-8B model utilizes a sparse attention mechanism, demonstrating performance comparable to Qwen-3-8B while requiring only 22% of the training cost [2][4] - The model achieves a remarkable inference speed of 600 Token/s, with a 220x acceleration in extreme scenarios, significantly enhancing long text processing capabilities [2][3] Performance and Architecture - MiniCPM 4.0 offers a 5x acceleration in long text inference speed compared to similar models like Qwen-3-8B and Llama-3-8B, with a maximum acceleration of 220x under memory-constrained conditions [3][4] - The model's architecture, InfLLMv2, reduces the sparsity from the industry standard of 40%-50% to just 5%, allowing for efficient long text calculations with only 1/10 of the computational load [4] - In terms of memory usage, MiniCPM 4.0-8B requires only 1/4 of the cache storage space compared to Qwen3-8B for 128K long text scenarios, indicating significant model compression and efficiency [4] Applications and Market Impact - Based on the 8B version, the company has fine-tuned two specific capability models for use as MCP Client and a research tool, MiniCPM4-Surve, which competes with Deep Research [5] - The MiniCPM series has achieved over 10 million downloads across all platforms, indicating strong market interest and adoption [5]

面壁MiniCPM4端侧模型发布：长文本推理 5 倍提速，0.5B 模型拿下新SOTA

AI科技大本营· 2025-06-10 09:31

Core Viewpoint - The release of MiniCPM4.0 marks a significant advancement in edge-side models, showcasing innovations in performance, speed, and storage efficiency, particularly for long text processing [1][4][32] Group 1: Model Performance and Efficiency - MiniCPM4.0-8B is the first native sparse model with a 5% sparsity, achieving a performance comparable to Qwen-3-8B while using only 22% of the training resources [2][5][6] - MiniCPM4.0-0.5B demonstrates impressive performance with a training cost of just 2.7%, outperforming larger models like Qwen-3-0.6B and Llama 3.2, achieving a speed of 600 Token/s [2][5][9] - The model's architecture allows for a 5x speed increase in long text inference and up to 220x in extreme scenarios, addressing the industry's challenge of slow long text processing [4][9][16] Group 2: Technological Innovations - The introduction of the InfLLM sparse attention architecture significantly reduces computational costs, allowing for efficient long text processing by lowering the sparsity from 40%-50% to 5% [18][19][20] - MiniCPM4.0 employs a three-tiered self-developed inference framework, CPM.cu, which optimizes performance for edge devices, achieving a 5x speed enhancement [21][22] - The model utilizes advanced quantization techniques, including P-GPTQ and BitCPM, to minimize computational and memory demands, ensuring efficient deployment [23][24] Group 3: Data and Training Efficiency - The company emphasizes the importance of high-quality data, utilizing innovative methods to construct datasets, which significantly reduces validation costs by 90% [29][30] - The training strategy incorporates the upgraded Model Wind Tunnel v2, optimizing hyperparameter configurations and enhancing GPU resource utilization [30][32] - MiniCPM4.0's development reflects a commitment to maximizing research investment returns through systematic improvements across data, training, and inference processes [28][32] Group 4: Market Position and Future Directions - MiniCPM4.0 has achieved over 10 million downloads across all platforms, indicating strong market acceptance and recognition [32] - The company plans to continue enhancing model knowledge density and intelligence levels, driving efficient development and large-scale applications in edge-side AI [32]

端侧模型

模型量化

稀疏注意力架构

Artificial Intelligence

Artificial Intelligence

MiniCPM4.0

MiniCPM4-Survey