MiniCPM4
Search documents
第二代InfLLM开源,同尺寸快三倍,零参数,可训练稀疏注意力
3 6 Ke· 2025-10-09 12:12
Core Insights - InfLLM-V2 is an efficient sparse attention model designed to handle long texts with minimal data, achieving performance close to traditional dense models [1][2] - The model allows seamless switching between short and long text processing modes, significantly enhancing efficiency and quality for long-context tasks [1][2] - InfLLM-V2 demonstrates a fourfold speed improvement over dense attention mechanisms while maintaining 98.1% performance in long text understanding tasks and 99.7% in deep reasoning tasks [1][2] Summary by Sections Model Advantages - Low-cost training requires only 5 billion long text tokens to achieve sparse attention capability, resulting in reduced training costs and shorter adaptation cycles [2] - The model supports seamless switching from dense to sparse attention without adding parameters, aligning with mainstream training paradigms for stability and faster convergence [2] - Efficient operator implementation optimizes the time bottleneck in sparse attention through hardware-friendly designs, significantly reducing HBM I/O and computational overhead [2] Technical Mechanism - InfLLM-V2 replaces the dense attention paradigm, where each query interacts with all keys, with a sparse approach that limits interactions to a selected subset, thus reducing computational costs [3][4] - The model introduces a two-step process: block selection to determine relevant key-value subsets and sparse attention calculation only on selected subsets [4][6] Performance Evaluation - In long text understanding tasks, InfLLM-V2 matches the performance of dense attention models, while other sparse attention methods show performance degradation [9] - In deep reasoning tasks, InfLLM-V2 achieves comparable performance to dense attention, while NSA methods negatively impact model effectiveness [11] - Efficiency tests reveal that InfLLM-V2 can achieve 4-9 times acceleration in operator speed compared to dense attention, with significant improvements in prefill and decode phases [13][17] Future Developments - The company plans to continue optimizing the training and inference operators of InfLLM-V2 and integrate it into mainstream inference frameworks [20]
0.5B以小搏大拿下端侧模型新SOTA:4090可跑,长文本处理5倍常规加速丨清华&面壁开源
量子位· 2025-06-10 07:35AI Processing
清华大学&面壁智能 投稿 量子位 | 公众号 QbitAI 端侧性价比之王,清华大学和面壁智能团队开源新模型—— MiniCP M 4 ,提供 8B、0.5B 两种参数规模, 仅使用同级别开源模型22%的训练开销 ,就达到了同级别最优性能。 MiniCPM4-8B是 开源首个开源的原生稀疏模型,5%的极高稀疏度加持,让长文本、深思考在端侧真正跑起来。 在MMLU、CEval、MATH500、HumanEval等基准测试中,以仅22%的训练开销,性能比肩 Qwen-3-8B,超越Gemma-3-12B。 MiniCPM4-0.5B 在性能上,也展现出以小博大——在MMLU、CEval、BBH、HumanEval等基准测试中,MiniCPM4.0 -0.5B性能超越同级 的Qwen-3-0.6B、Llama 3.2、Gemma3, 并通过 原生QAT技术 实现几乎不掉点的int4量化以及600Token/s的极速推理速度。 在常见端侧芯片,比如Jetson AGX Orin与RTX 4090上,MiniCPM 4可实现长文本处理的5倍常规加速与极限场景下的百倍加速。 请看VCR: 目前团队已公开发布技术报告,该模 ...
0.5B以小搏大拿下端侧模型新SOTA:4090可跑,长文本处理5倍常规加速丨清华&面壁开源
量子位· 2025-06-10 07:35
Core Insights - MiniCPM4, developed by Tsinghua University and Weizhi Intelligent Team, is an open-source model that achieves optimal performance with only 22% of the training cost compared to similar models, offering 8B and 0.5B parameter sizes [1][3][4] - The model utilizes a novel sparse attention mechanism, InfLLM v2, which allows for efficient long-context processing, achieving a 5% sparsity rate [2][8][16] - MiniCPM4 demonstrates superior performance in benchmark tests, outperforming models like Qwen-3 and Gemma-3 while using significantly less training data [3][11][116] Model Performance - MiniCPM4-8B matches the performance of Qwen-3-8B and surpasses Gemma-3-12B with only 22% of the training data used by Qwen-3 [3][116] - MiniCPM4-0.5B outperforms Qwen-3-0.6B and Llama 3.2 in various benchmark tests, showcasing its efficiency in smaller parameter sizes [3][11] - The model achieves a decoding speed of 600 tokens per second with minimal performance loss during quantization [3][10] Technical Innovations - The InfLLM v2 architecture allows for efficient long-context processing by dynamically selecting relevant context tokens, reducing computational costs by 60% compared to previous methods [8][11][16] - The model incorporates a lightweight CUDA inference framework (CPM.cu) and a cross-platform deployment framework (ArkInfer) to optimize performance on edge devices [19][20][40] - The FR-Spec algorithm enhances speculative sampling efficiency, reducing computational overhead by 75% while maintaining output accuracy [28][30] Data Efficiency - MiniCPM4 achieves high capability density by utilizing only 8 trillion tokens for training, compared to 36 trillion tokens used by Qwen-3, demonstrating effective data filtering strategies [56][116] - The UltraClean data selection method enhances the quality of pre-training data, significantly improving model performance [57][61] Application and Use Cases - MiniCPM4 is designed for long document understanding and generation, proving effective in tasks such as automated literature review generation and complex tool interactions [120][130] - The model's ability to handle long sequences and maintain high accuracy in context extrapolation makes it suitable for various applications in AI-driven tasks [118][119]