开源多模态推理「破壁」时刻：MMFineReason助力4B逆袭30B

Core Insights - The article highlights the significant gap between open-source multimodal models and top closed-source models like GPT-4o and Gemini, primarily due to a lack of high-quality reasoning data [2] - The introduction of the MMFineReason framework by OpenDataLab aims to address this gap by providing a comprehensive, open-source multimodal reasoning data synthesis pipeline [2][10] Data Challenges - Existing open-source multimodal data is predominantly focused on simple Visual Question Answering (VQA) and natural images, with a scarcity of high-value reasoning data such as STEM charts and complex visual symbols [6] - The quality of reasoning data is inconsistent, often characterized by short reasoning processes and insufficient granularity in annotations [6] Performance Results - The MMFineReason-4B model, trained on Qwen3-VL-4B, demonstrates superior reasoning capabilities, surpassing the Qwen3-VL-8B-Thinking model and approaching the performance of the 30B parameter Qwen3-VL-30B-A3B-Thinking model [5] - The MMFineReason-8B model outperforms both Qwen3-VL-30B-A3B-Thinking and Gemini-2.5-Flash, indicating a significant leap in performance driven by data quality rather than model architecture [8] Data Production Pipeline - MMFineReason employs a fully open-source and transparent data production pipeline, consisting of three main stages to ensure high-quality data generation [12] - The final datasets include MMFineReason-1.8M, MMFineReason-586K, and MMFineReason-123K, each curated for different levels of reasoning difficulty [14] Dataset Characteristics - MMFineReason is characterized by a high average reasoning chain length of 2,910 tokens, significantly longer than comparable datasets, which enhances the model's reasoning capabilities [16] - The dataset emphasizes high-difficulty logical reasoning, with 79.4% of the data focused on mathematics, 13.8% on scientific data, and 4.6% on puzzles and games [19] Conclusion and Future Outlook - The open-sourcing of MMFineReason demonstrates that in the multimodal field, the key to improving model performance lies in the quality of data rather than the size of the model [23] - The project is now available on Huggingface and GitHub, providing comprehensive support for the open-source community [23]