Workflow
MMLongBench
icon
Search documents
多模态长文本理解测评首发:46款模型无一攻克128K难关
量子位· 2025-05-23 06:14
Core Viewpoint - The article discusses the introduction of MMLongBench, a comprehensive benchmark for evaluating the long-text understanding capabilities of multimodal models, developed by researchers from various institutions including Hong Kong University of Science and Technology and Tencent AI Lab [1][2]. Group 1: Benchmark Overview - MMLongBench aims to assess the performance of long-context vision-language models (LCVLMs) across five challenging tasks, utilizing a dataset of 13,331 long text samples from 16 different datasets [2][8]. - The benchmark includes tasks such as Visual RAG, Needle-in-a-Haystack, Many-Shot In-Context Learning, Summarization, and Long-Document VQA, covering a variety of image types and contexts [2][8]. - The dataset is designed to control cross-modal length by calculating context length using image patches and text tokens, standardizing input lengths of 8K, 16K, 32K, 64K, and 128K [3][11]. Group 2: Model Evaluation - A total of 46 leading multimodal large language models, including both closed-source and open-source models, were benchmarked, revealing significant challenges in long-context visual-language tasks [5][12]. - The results indicate that all models face considerable difficulties, with top models like InternVL3-38B and Qwen2.5-VL-72B scoring below 50 in average performance at 128K token length [14]. - Error analysis identified OCR capabilities and cross-modal retrieval abilities as key bottlenecks for LCVLMs when processing long texts [7][19]. Group 3: Findings and Insights - The study found that models with stronger reasoning capabilities tend to perform better in long-context tasks, with notable improvements in summarization and DocVQA tasks for reasoning-enhanced models [15]. - It was observed that relying on single-task performance does not effectively reflect a model's overall long-context understanding ability, emphasizing the necessity of diverse downstream tasks for evaluation [17]. - The analysis of errors in Long-Document VQA and Visual RAG tasks highlighted that OCR capabilities remain a significant limitation for LCVLMs in handling long document inputs [19][21].