Core Insights - The article discusses the introduction of the All-Angles Bench, a new benchmark for evaluating multi-view understanding capabilities of multi-modal large language models (MLLMs) [2][4]. Group 1: Overview of All-Angles Bench - All-Angles Bench aims to comprehensively assess the multi-view understanding abilities of MLLMs, featuring over 2,100 manually annotated multi-view question-answer pairs across 90 real-world scenarios [2][8]. - The benchmark includes six challenging tasks: Counting, Attribute Identification, Relative Distance, Relative Direction, Object Manipulation, and Camera Pose Estimation, which evaluate the models' understanding of 3D scenes [8][9]. Group 2: Performance Evaluation - A total of 27 leading MLLMs were benchmarked, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o, revealing a significant gap between their performance and human-level understanding [4][14]. - In the Camera Pose Estimation task, human annotators achieved an accuracy of 88.9%, while top models like Gemini-2.0-Flash lagged behind by over 50% [16]. Group 3: Findings and Analysis - Certain open-source models, such as Ovis2-34B and Qwen2.5-VL-72B, outperformed closed-source models in direction-sensitive tasks, likely due to their superior video understanding and visual localization capabilities [17]. - The analysis revealed inconsistencies in MLLMs' responses, particularly in tasks involving relative direction, indicating challenges in multi-view understanding [20][23]. - MLLMs struggled with integrating fragmented information across views, often miscounting objects when visibility was partial [24][31]. Group 4: Recommendations for Improvement - The article suggests that merely optimizing prompts is insufficient for enhancing multi-view understanding; dedicated multi-view training is necessary for substantial performance improvements [32].
GPT-4o不敌Qwen,无一模型及格!UC伯克利/港大等联合团队提出多模态新基准:考察多视图理解能力
量子位·2025-05-14 06:07