面壁MiniCPM4端侧模型 - filings, earnings calls, financial reports, news

面壁MiniCPM4端侧模型

Search documents

长文本推理 5 倍提速！面壁MiniCPM4 端侧模型发布，0.5B模型效果秒杀同级

AI前线· 2025-06-12 06:07

Core Viewpoint - The newly released MiniCPM4.0 model series, featuring 8B and 0.5B parameter scales, significantly enhances edge-side performance and adaptability for various terminal scenarios [1][6]. Model Performance - MiniCPM4.0-8B is the first native sparse model with a 5% sparsity, achieving performance comparable to Qwen-3-8B while using only 22% of the training cost [2][4]. - In benchmark tests like MMLU, CEval, and HumanEval, MiniCPM4.0-0.5B outperforms similar models such as Qwen-3-0.6B and Llama 3.2, achieving a rapid inference speed of 600 Token/s [4][6]. Technological Innovations - The model employs a new context-sparse architecture that allows for a 5x speed increase in long text inference and up to 220x in memory-constrained scenarios [6][8]. - MiniCPM4.0 reduces long text cache requirements to just 1/4 of that needed by Qwen3-8B, achieving a 90% model size reduction while maintaining robust performance [8][10]. Model Architecture - The InfLLMv2 sparse attention architecture allows for efficient "sampling" of relevant text segments, reducing computational costs by 90% compared to traditional models [14][15]. - The model features a dual-frequency switching mechanism that optimizes attention modes for long and short texts, enhancing efficiency and accuracy [17]. Deployment and Adaptation - MiniCPM4.0 has been adapted for major chip platforms including Intel, Qualcomm, and Huawei Ascend, and supports various open-source frameworks [10][24]. - The ArkInfer cross-platform deployment framework addresses the challenges of chip fragmentation, providing a versatile solution for model deployment [25]. Data and Training Innovations - The company utilizes a high-density data selection mechanism to construct high-quality datasets, achieving a 90% reduction in validation costs [28][29]. - The training strategy incorporates advanced techniques like FP8 training and chunk-wise rollout to optimize GPU resource utilization [30].

端侧模型

稀疏注意力架构

Artificial Intelligence

Artificial Intelligence

面壁MiniCPM4端侧模型

Qwen-3-8B

Gemma-3-12B