0.5B以小搏大拿下端侧模型新SOTA：4090可跑，长文本处理5倍常规加速丨清华&面壁开源

Core Insights - MiniCPM4, developed by Tsinghua University and Weizhi Intelligent Team, is an open-source model that achieves optimal performance with only 22% of the training cost compared to similar models, offering 8B and 0.5B parameter sizes [1][3][4] - The model utilizes a novel sparse attention mechanism, InfLLM v2, which allows for efficient long-context processing, achieving a 5% sparsity rate [2][8][16] - MiniCPM4 demonstrates superior performance in benchmark tests, outperforming models like Qwen-3 and Gemma-3 while using significantly less training data [3][11][116] Model Performance - MiniCPM4-8B matches the performance of Qwen-3-8B and surpasses Gemma-3-12B with only 22% of the training data used by Qwen-3 [3][116] - MiniCPM4-0.5B outperforms Qwen-3-0.6B and Llama 3.2 in various benchmark tests, showcasing its efficiency in smaller parameter sizes [3][11] - The model achieves a decoding speed of 600 tokens per second with minimal performance loss during quantization [3][10] Technical Innovations - The InfLLM v2 architecture allows for efficient long-context processing by dynamically selecting relevant context tokens, reducing computational costs by 60% compared to previous methods [8][11][16] - The model incorporates a lightweight CUDA inference framework (CPM.cu) and a cross-platform deployment framework (ArkInfer) to optimize performance on edge devices [19][20][40] - The FR-Spec algorithm enhances speculative sampling efficiency, reducing computational overhead by 75% while maintaining output accuracy [28][30] Data Efficiency - MiniCPM4 achieves high capability density by utilizing only 8 trillion tokens for training, compared to 36 trillion tokens used by Qwen-3, demonstrating effective data filtering strategies [56][116] - The UltraClean data selection method enhances the quality of pre-training data, significantly improving model performance [57][61] Application and Use Cases - MiniCPM4 is designed for long document understanding and generation, proving effective in tasks such as automated literature review generation and complex tool interactions [120][130] - The model's ability to handle long sequences and maintain high accuracy in context extrapolation makes it suitable for various applications in AI-driven tasks [118][119]