视频推理界的“福尔摩斯测试”：所有大模型，统统不及格

Core Viewpoint - The introduction of the Video-Holmes benchmark by Tencent ARC Lab and City University of Hong Kong reveals that current large models fail to perform adequately in complex video reasoning tasks, highlighting significant gaps in their reasoning capabilities [1][7]. Group 1: Benchmark Overview - Video-Holmes serves as a new benchmark for evaluating complex video reasoning abilities, designed to address the shortcomings of existing benchmarks that often feature overly simplistic video sources and questions [1][8]. - The benchmark includes 270 short films, each 1-5 minutes long, and poses seven high-reasoning requirement questions that compel models to extract and connect multiple key pieces of information scattered throughout the videos [9]. Group 2: Model Performance - All tested large models performed poorly, with none passing the benchmark, indicating a widespread deficiency in their reasoning capabilities [5][6]. - The average scores across various reasoning categories (e.g., Social Reasoning, Intent and Motive Chain, Time Causal Inference) were notably low, with the highest average score being 51.3 for Gemini-2.5-Pro [6]. Group 3: Reasoning Process Analysis - The analysis of reasoning processes showed that while models could correctly perceive visual information, they struggled significantly with linking clues and often overlooked critical visual details [18]. - Specific examples of reasoning errors were noted, where models misinterpreted interactions or failed to accurately assess relationships based on video content [15][16]. Group 4: Accessibility and Tools - The Video-Holmes benchmark and its associated resources, including evaluation codes and model integration tools, have been made open-source and are available on platforms like GitHub and HuggingFace [19][20]. - Users interested in testing their models against the benchmark can access a comprehensive guide and necessary commands for setup and evaluation [19][20].