统一OCR和知识学习范式

Search documents
8B硬刚72B!MiniCPM-V 4.5技术报告正式出炉
量子位· 2025-09-23 11:01
Core Viewpoint - The technical report on MiniCPM-V 4.5, the industry's first multimodal model with high-refresh video understanding capabilities, has been officially released, showcasing significant advancements in video and document processing technologies [1][2]. Group 1: Technical Innovations - MiniCPM-V 4.5 introduces three key technologies: a unified 3D-Resampler architecture for high-density video compression, a unified OCR and knowledge learning paradigm for document processing, and a controllable hybrid fast/deep thinking multimodal reinforcement learning approach [2][8]. - The 3D-Resampler architecture achieves a remarkable 96x compression rate for visual tokens, allowing the model to process more video frames without increasing computational costs [11][12]. - The unified OCR and knowledge learning paradigm eliminates reliance on external parsing tools, significantly reducing data noise and engineering complexity, leading to superior performance in document understanding tasks [25][24]. Group 2: Model Performance - MiniCPM-V 4.5 has received widespread acclaim upon its open-source release, ranking second on HuggingFace's trending list, with over 220,000 downloads across major platforms [3][4]. - The model outperforms other leading models, including GPT-4o-latest and Qwen2.5-VL-72B, achieving state-of-the-art (SOTA) performance in various tasks while maintaining a parameter size of only 8 billion [34][36]. - In the OpenCompass evaluation, MiniCPM-V 4.5 achieved an average score of 77.0, demonstrating its superior visual language capabilities compared to other models in its class [34][36]. Group 3: Efficiency and Cost Reduction - The model's design allows for a significant reduction in training costs, with a 30% decrease in sampling expenses while maintaining high performance across both fast and deep thinking modes [29][30]. - The 3D-Resampler architecture not only enhances video processing efficiency but also ensures seamless knowledge transfer between image and video tasks, further optimizing resource utilization [11][12][14]. - The hybrid reinforcement learning approach balances the need for quick responses in everyday scenarios with the depth required for complex tasks, enhancing overall model reliability [27][32]. Group 4: Community and Recognition - The MiniCPM series, developed by Tsinghua University's NLP lab and Wanbi Intelligence, has gained significant academic and industrial recognition, with over 13 million downloads and numerous accolades [49]. - The model's contributions to the field have been acknowledged in prestigious publications and forums, highlighting its impact on multimodal AI research [49].