Workflow
视频OCR
icon
Search documents
全面评估多模态模型视频OCR能力,Gemini 准确率仅73.7%
量子位· 2025-05-30 07:10
Core Viewpoint - The article discusses the challenges and advancements of Multi-Modal Large Language Models (MLLM) in Optical Character Recognition (OCR) for dynamic video content, highlighting the need for improved evaluation frameworks and model capabilities in this area [1][2][5]. Group 1: Model Capabilities and Challenges - MLLM has shown excellent OCR capabilities on static images but faces significant challenges when applied to dynamic video scenarios [1][2]. - The MME-VideoOCR framework aims to systematically evaluate and enhance MLLM's perception, understanding, and reasoning abilities in video OCR [3]. - Current MLLM capabilities are limited by factors such as motion blur, lighting changes, and complex temporal associations in videos, which complicate text recognition [5][21]. Group 2: Data and Task Design - MME-VideoOCR has constructed a detailed task system with 10 major task categories and 25 independent tasks, focusing on high-level capabilities like temporal understanding and complex reasoning [6][15]. - A high-quality, large-scale dataset was created, including 1,464 selected video clips and 2,000 manually annotated question-answer pairs to ensure evaluation accuracy [4][12]. Group 3: Evaluation Findings - An in-depth evaluation of 18 mainstream MLLMs revealed that even the best-performing model, Gemini-2.5 Pro, achieved only a 73.7% accuracy, indicating substantial room for improvement in video OCR tasks [7][20]. - The performance gap between closed-source and open-source models is significant, with many open-source models scoring below 60% in accuracy [20]. Group 4: Key Limitations - MLLMs struggle with tasks requiring long temporal integration and dynamic text understanding, showcasing weaknesses in temporal reasoning capabilities [21]. - There is a tendency for models to overly rely on prior language knowledge rather than effectively utilizing visual information for video text comprehension [22]. Group 5: Optimization Strategies - Providing higher resolution visual inputs and more comprehensive temporal frame coverage is crucial for enhancing MLLM performance in dynamic video scenarios [23]. - However, an increase in visual input may lead to difficulties in focusing on target information, necessitating improved information extraction and processing capabilities [23].