打破数据质量鸿沟！清华腾讯Bee项目发布1500万高质量数据集，刷新MLLM全栈开源SOTA

Core Insights - The article discusses the launch of the Bee project by Tsinghua University and Tencent's Mixuan team, aimed at bridging the performance gap between fully open-source multimodal large language models (MLLMs) and their closed or semi-open counterparts [2][5][26]. Group 1: Background and Motivation - The current MLLM landscape exhibits a three-tier structure: (1) top-tier closed-source models (e.g., Gemini 2.5, GPT-5), (2) semi-open models with private data (e.g., Qwen2.5-VL), and (3) significantly underperforming fully open-source models [5]. - The core bottleneck is identified as the "data quality gap" rather than model architecture [2][10]. Group 2: Key Contributions of the Bee Project - Honey-Data-15M: A high-quality SFT dataset comprising 15 million samples, enhanced through a dual-layer Chain of Thought (CoT) approach [6][16]. - HoneyPipe & DataStudio: An open-source, end-to-end data enhancement pipeline that provides a transparent and reproducible methodology for data cleaning and CoT augmentation [6][12]. - Bee-8B: A new 8 billion parameter model trained on Honey-Data-15M, achieving state-of-the-art (SOTA) results in various benchmarks, rivaling or surpassing mainstream semi-open models [6][21][26]. Group 3: Data Quality Issues - Existing open-source datasets suffer from two main issues: pervasive noise (e.g., factual inaccuracies, mismatched images) and a lack of complex reasoning data [11][14]. - The Bee project emphasizes that the most viable path for the open-source community is to focus on "data quality" rather than merely increasing "data quantity" [11][26]. Group 4: HoneyPipe Process - The HoneyPipe process involves a meticulous "filter-enhance-validate" workflow that produces high-quality datasets [15][18]. - The process includes three stages: noise and irrelevance filtering, short CoT enhancement and validation, and long CoT enhancement for complex queries [18]. Group 5: Performance of Bee-8B - Bee-8B demonstrates superior performance across various benchmarks, including MathVerse and LogicVista, where it achieved scores of 67.0 and 61.3, respectively, outperforming semi-open models [28]. - In general VQA tasks, Bee-8B achieved excellent SOTA scores in multiple benchmarks, including MMStar and CountBench [28]. Group 6: Conclusion - The Bee project effectively addresses the core data quality issues hindering the development of fully open-source MLLMs, advocating for a methodology that prioritizes data quality over sheer volume [26].