多模态大模型(MLLM)

Search documents
 多模态大模型,真的「懂」世界吗?——揭秘 MLLM 的核心知识缺陷
 机器之心· 2025-07-28 02:47
 Core Insights - The article highlights that Multi-Modal Language Models (MLLMs) exhibit impressive capabilities in high-level visual understanding and reasoning tasks, yet they frequently fail in seemingly simple tasks that even infants can accomplish [1][2] - It questions whether MLLMs lack "core knowledge," which is essential for early human learning, indicating a potential cognitive blind spot in these models [2][5]   Research Findings - A study from UC San Diego titled "Core Knowledge Deficits in Multi-Modal Language Models" systematically analyzes the lack of core cognitive abilities in mainstream MLLMs [3][5] - The research reveals that current MLLMs widely lack core cognitive abilities, which cannot be naturally acquired through model scaling [5][12]   CoreCognition Framework - The authors developed an innovative multi-modal assessment system called CoreCognition, along with a unique "Concept Hacking" method to test whether models genuinely understand the core knowledge behind tasks or are merely guessing [6][18] - CoreCognition is a large-scale assessment framework focusing on core knowledge, inspired by Piaget's theories of cognitive development, and aims to bridge the gap between cognitive science and AI testing [9][11]   Assessment Design - The CoreCognition dataset includes 1,503 image-question pairs and generates 2,530 evaluation data points across 230 mainstream multi-modal models and 11 prompt designs, effectively covering various model scales and instruction comprehension [11] - The assessment is designed to be discriminative, minimizing confounding factors and avoiding text shortcuts, ensuring that models must engage in multi-modal reasoning to arrive at correct answers [11][12]   Key Findings on Model Performance - MLLMs show significant deficiencies in basic cognitive tasks, particularly in areas like boundary perception and spatial awareness, performing poorly compared to their understanding of more complex tasks [12][14] - The study indicates that increasing model size does not significantly enhance basic cognitive abilities, and in some cases, larger models perform worse on foundational tasks [16][20]   Concept Hacking Methodology - The Concept Hacking method involves creating control and manipulated groups to test models' understanding of core concepts by reversing key features while keeping other conditions constant [18][29] - Results show that many models perform well on standard tasks but fail dramatically when key features are altered, indicating a reliance on superficial learning rather than true understanding [20][30]   Implications and Future Directions - The findings suggest that MLLMs lack the foundational cognitive scaffolding that humans use to build higher-level reasoning, posing a fundamental challenge to the current model development path focused on scaling [22][30] - Future directions may include explicitly injecting physical and spatial common sense into pre-training phases, exploring cognitive-guided training mechanisms, and developing more controlled assessments of cognitive abilities [30]
 CVPR2025视频生成统一评估架构,上交x斯坦福联合提出让MLLM像人类一样打分
 量子位· 2025-06-12 08:17
 Core Viewpoint - Video generation technology is rapidly transforming visual content creation across various sectors, including film production, advertising design, virtual reality, and social media, making high-quality video generation models increasingly important [1].   Group 1: Video Evaluation Framework - The Video-Bench framework evaluates AI-generated videos by simulating human cognitive processes, establishing an intelligent assessment system that connects text instructions with visual content [2]. - Video-Bench enables multimodal large models (MLLM) to evaluate videos similarly to human assessments, effectively identifying defects in object consistency (0.735 correlation) and action rationality, while also addressing traditional challenges in aesthetic quality evaluation [3].   Group 2: Innovations in Video-Bench - Video-Bench addresses two main issues in existing video evaluation methods: the inability to capture complex dimensions like video fluency and aesthetic performance, and the challenges in cross-modal comparison during video-condition alignment assessments [5]. - The framework introduces two core innovations: a dual-dimensional evaluation framework covering video-condition alignment and video quality [7], and the implementation of chain-of-query and few-shot scoring techniques [8].   Group 3: Evaluation Dimensions - The dual-dimensional evaluation framework allows Video-Bench to assess video generation quality by breaking it down into "video-condition alignment" and "video quality," focusing on the accuracy of generated content against text prompts and the visual quality of the video itself [10]. - Key dimensions for video-condition consistency include object category consistency, action consistency, color consistency, scene consistency, and video-text consistency, while video quality evaluation emphasizes imaging quality, aesthetic quality, temporal consistency, and motion quality [10].   Group 4: Performance Comparison - Video-Bench significantly outperforms traditional evaluation methods, achieving an average Spearman correlation of 0.733 in video-condition alignment and 0.620 in video quality [18]. - In the critical metric of object category consistency, Video-Bench shows a 56.3% improvement over the GRiT-based method, reaching a correlation of 0.735 [19].   Group 5: Robustness and Reliability - Video-Bench's evaluation results were validated by a team of 10 experts who annotated 35,196 video samples, achieving a Krippendorff's α of 0.52, comparable to human self-assessment levels [21]. - The framework demonstrated high stability and reliability, with a TARA@3 score of 67% and a Krippendorff's α of 0.867, confirming the effectiveness of its component designs [23].   Group 6: Current Model Assessment - Video-Bench evaluated seven mainstream video generation models, revealing that commercial models generally outperform open-source models, with Gen3 scoring an average of 4.38 compared to VideoCrafter2's 3.87 [25]. - The assessment highlighted weaknesses in dynamic dimensions such as action rationality (average score of 2.53/3) and motion blur (3.11/5) across current models [26].
 CVPR2025视频生成统一评估架构,上交x斯坦福联合提出让MLLM像人类一样打分
 量子位· 2025-06-12 08:16
 Core Viewpoint - Video generation technology is rapidly transforming visual content creation across various sectors, emphasizing the importance of high-quality video generation models that align with human expectations [1].   Group 1: Video Evaluation Framework - The Video-Bench framework simulates human cognitive processes to establish an intelligent evaluation system that connects text instructions with visual content [2]. - Video-Bench enables multimodal large models (MLLM) to evaluate videos similarly to human assessments, identifying defects in object consistency (0.735 correlation) and action rationality, while also effectively assessing aesthetic quality [3][18].   Group 2: Innovations in Video Evaluation - Video-Bench addresses two main issues in existing video evaluation methods: the inability to capture complex dimensions like video fluency and aesthetics, and the challenges in cross-modal comparisons for video-text alignment [5]. - The framework introduces a dual-dimensional evaluation system covering video-condition alignment and video quality [7]. - Key technologies include Chain-of-Query, which resolves cross-modal alignment issues through iterative questioning, and Few-shot scoring, which quantifies subjective aesthetic judgments by comparing multiple videos [8][13].   Group 3: Comprehensive Evaluation Metrics - Video-Bench dissects video generation quality into two orthogonal dimensions: video-condition alignment and video quality, assessing both the fidelity to text prompts and the visual quality of the video itself [10]. - The evaluation framework includes metrics for object category consistency, action consistency, color consistency, scene consistency, imaging quality, aesthetic quality, temporal consistency, and motion quality [10][11].   Group 4: Performance Comparison - Video-Bench significantly outperforms traditional methods, achieving an average Spearman correlation of 0.733 in video-condition alignment and 0.620 in video quality [18]. - In the critical metric of object category consistency, Video-Bench shows a 56.3% improvement over GRiT methods, reaching a correlation of 0.735 [19]. - A reliability test with a panel of 10 experts on 35,196 video samples yielded a consistency score (Krippendorff's α) of 0.52, comparable to human self-assessment levels [21].   Group 5: Current Model Evaluations - Video-Bench evaluated seven mainstream video generation models, revealing that commercial models generally outperform open-source models, with Gen3 scoring an average of 4.38 compared to VideoCrafter2's 3.87 [25]. - The evaluation highlighted weaknesses in dynamic dimensions such as action rationality (average score of 2.53/3) and motion blur (3.11/5) [26]. - Comparisons among foundational models indicated that GPT-4o typically excels in video quality and consistency scores, particularly in imaging quality (0.807) and video-text consistency (0.750) [27].
 全面评估多模态模型视频OCR能力,Gemini 准确率仅73.7%
 量子位· 2025-05-30 07:10
 Core Viewpoint - The article discusses the challenges and advancements of Multi-Modal Large Language Models (MLLM) in Optical Character Recognition (OCR) for dynamic video content, highlighting the need for improved evaluation frameworks and model capabilities in this area [1][2][5].   Group 1: Model Capabilities and Challenges - MLLM has shown excellent OCR capabilities on static images but faces significant challenges when applied to dynamic video scenarios [1][2]. - The MME-VideoOCR framework aims to systematically evaluate and enhance MLLM's perception, understanding, and reasoning abilities in video OCR [3]. - Current MLLM capabilities are limited by factors such as motion blur, lighting changes, and complex temporal associations in videos, which complicate text recognition [5][21].   Group 2: Data and Task Design - MME-VideoOCR has constructed a detailed task system with 10 major task categories and 25 independent tasks, focusing on high-level capabilities like temporal understanding and complex reasoning [6][15]. - A high-quality, large-scale dataset was created, including 1,464 selected video clips and 2,000 manually annotated question-answer pairs to ensure evaluation accuracy [4][12].   Group 3: Evaluation Findings - An in-depth evaluation of 18 mainstream MLLMs revealed that even the best-performing model, Gemini-2.5 Pro, achieved only a 73.7% accuracy, indicating substantial room for improvement in video OCR tasks [7][20]. - The performance gap between closed-source and open-source models is significant, with many open-source models scoring below 60% in accuracy [20].   Group 4: Key Limitations - MLLMs struggle with tasks requiring long temporal integration and dynamic text understanding, showcasing weaknesses in temporal reasoning capabilities [21]. - There is a tendency for models to overly rely on prior language knowledge rather than effectively utilizing visual information for video text comprehension [22].   Group 5: Optimization Strategies - Providing higher resolution visual inputs and more comprehensive temporal frame coverage is crucial for enhancing MLLM performance in dynamic video scenarios [23]. - However, an increase in visual input may lead to difficulties in focusing on target information, necessitating improved information extraction and processing capabilities [23].




