AI视频理解

Search documents
大模型无法真正理解视频,GPT-4o正确率仅36%,南洋理工大团队提出新基准
量子位· 2025-08-01 07:19
Core Viewpoint - The development of Video Large Language Models (Video LLMs) raises the question of whether these models truly "understand" video content or merely perform advanced "pattern matching" [2][3]. Group 1: Introduction of Video Thinking Test (Video-TT) - Researchers from Nanyang Technological University proposed a new benchmark test called Video Thinking Test (Video-TT) to separate the ability to "see" from the ability to "think" [2][3]. - The primary goal of Video-TT is to accurately measure AI's true understanding and reasoning capabilities regarding video content [3]. Group 2: Key Findings - Human performance in video understanding significantly surpasses state-of-the-art (SOTA) models, achieving an accuracy rate of 84.3% compared to the 50% of SOTA models [4][29]. - Open-source models show inferior robustness compared to GPT-4o, which is one of the SOTA models [5]. - GPT-4o struggles with recognizing ambiguous or unconventional content and has difficulties with multi-scene differentiation and world knowledge [5]. Group 3: Limitations of Existing Benchmarks - Current video understanding benchmarks fail to distinguish whether a model's errors stem from not "seeing" enough key frames or from lacking genuine reasoning abilities [9][10]. - The "frame sampling paradox" in long video assessments leads to uncertainty about a model's capabilities when it answers incorrectly due to limited frame sampling [12][13]. - Short video assessments create a "ceiling illusion," where models appear to perform at human levels, misleadingly suggesting that short video understanding issues are resolved [15][16]. Group 4: Design Principles of Video-TT - Video-TT emphasizes the complexity of questions to stimulate "thinking," focusing on context, reasons, and scenarios rather than just question types [17]. - The test incorporates two core dimensions of complexity: visual complexity and narrative complexity, each with four aspects [18][19]. Group 5: Evaluation Results - The evaluation results reveal a significant gap between current SOTA models and human understanding in video reasoning capabilities [26][29]. - GPT-4o's performance is notably below human levels, with a correctness score of only 36.6% [30]. - Open-source models show potential in multiple-choice questions but struggle with open-ended questions, indicating that existing benchmarks may overestimate model capabilities [31]. Group 6: Analysis of AI Errors - The analysis identifies three core weaknesses in models like GPT-4o: confusion in temporal and spatial relationships, lack of world knowledge, and failure to understand complex narratives [34][36]. - Models often misinterpret time and space, struggle with social and cultural context, and fail to connect narrative threads across scenes [38][40].