多模态大模型(MLLM)

Search documents
多模态大模型,真的「懂」世界吗?——揭秘 MLLM 的核心知识缺陷
机器之心· 2025-07-28 02:47
Core Insights - The article highlights that Multi-Modal Language Models (MLLMs) exhibit impressive capabilities in high-level visual understanding and reasoning tasks, yet they frequently fail in seemingly simple tasks that even infants can accomplish [1][2] - It questions whether MLLMs lack "core knowledge," which is essential for early human learning, indicating a potential cognitive blind spot in these models [2][5] Research Findings - A study from UC San Diego titled "Core Knowledge Deficits in Multi-Modal Language Models" systematically analyzes the lack of core cognitive abilities in mainstream MLLMs [3][5] - The research reveals that current MLLMs widely lack core cognitive abilities, which cannot be naturally acquired through model scaling [5][12] CoreCognition Framework - The authors developed an innovative multi-modal assessment system called CoreCognition, along with a unique "Concept Hacking" method to test whether models genuinely understand the core knowledge behind tasks or are merely guessing [6][18] - CoreCognition is a large-scale assessment framework focusing on core knowledge, inspired by Piaget's theories of cognitive development, and aims to bridge the gap between cognitive science and AI testing [9][11] Assessment Design - The CoreCognition dataset includes 1,503 image-question pairs and generates 2,530 evaluation data points across 230 mainstream multi-modal models and 11 prompt designs, effectively covering various model scales and instruction comprehension [11] - The assessment is designed to be discriminative, minimizing confounding factors and avoiding text shortcuts, ensuring that models must engage in multi-modal reasoning to arrive at correct answers [11][12] Key Findings on Model Performance - MLLMs show significant deficiencies in basic cognitive tasks, particularly in areas like boundary perception and spatial awareness, performing poorly compared to their understanding of more complex tasks [12][14] - The study indicates that increasing model size does not significantly enhance basic cognitive abilities, and in some cases, larger models perform worse on foundational tasks [16][20] Concept Hacking Methodology - The Concept Hacking method involves creating control and manipulated groups to test models' understanding of core concepts by reversing key features while keeping other conditions constant [18][29] - Results show that many models perform well on standard tasks but fail dramatically when key features are altered, indicating a reliance on superficial learning rather than true understanding [20][30] Implications and Future Directions - The findings suggest that MLLMs lack the foundational cognitive scaffolding that humans use to build higher-level reasoning, posing a fundamental challenge to the current model development path focused on scaling [22][30] - Future directions may include explicitly injecting physical and spatial common sense into pre-training phases, exploring cognitive-guided training mechanisms, and developing more controlled assessments of cognitive abilities [30]
CVPR2025视频生成统一评估架构,上交x斯坦福联合提出让MLLM像人类一样打分
量子位· 2025-06-12 08:17
Video-Bench 视频评估框架,能够通过模拟人类的认知过程,建立起连接文本指令与视觉内容的智能评估体系。 简单地说,能够让多模态大模型(MLLM)"像人一样评估视频"。 实验结果表明,Video-Bench不仅能精准识别生成视频在物体一致性(0.735相关性)、动作合理性等维度的缺陷,还能稳定评估美学质量等 传统难题,显著优于现有的评估方法。 Video-Bench团队 投稿 量子位 | 公众号 QbitAI 视频生成技术正以前所未有的速度革新着当前的视觉内容创作方式,从电影制作到广告设计,从虚拟现实到社交媒体,高质量且符合人类期望 的视频生成模型正变得越来越重要。 那么,要如何评估AI生成的视频是否符合人类的审美和需求呢? Video-Bench的研究团队来自上海交通大学、斯坦福大学、卡内基梅隆大学等机构。 Video-Bench:基于MLLM的自动化视频评估框架 Video-Bench团队在面对已有的视频评估方法时,发现了两个问题: 1.简单的评分规则往往无法捕捉视频流畅度、美学表现等复杂维度—— 那么,当评判"视频质量"时,如何将人类出于"直觉"的模糊感受转化为可量化的评估指标? 2.现有基于大语 ...
CVPR2025视频生成统一评估架构,上交x斯坦福联合提出让MLLM像人类一样打分
量子位· 2025-06-12 08:16
Core Viewpoint - Video generation technology is rapidly transforming visual content creation across various sectors, emphasizing the importance of high-quality video generation models that align with human expectations [1]. Group 1: Video Evaluation Framework - The Video-Bench framework simulates human cognitive processes to establish an intelligent evaluation system that connects text instructions with visual content [2]. - Video-Bench enables multimodal large models (MLLM) to evaluate videos similarly to human assessments, identifying defects in object consistency (0.735 correlation) and action rationality, while also effectively assessing aesthetic quality [3][18]. Group 2: Innovations in Video Evaluation - Video-Bench addresses two main issues in existing video evaluation methods: the inability to capture complex dimensions like video fluency and aesthetics, and the challenges in cross-modal comparisons for video-text alignment [5]. - The framework introduces a dual-dimensional evaluation system covering video-condition alignment and video quality [7]. - Key technologies include Chain-of-Query, which resolves cross-modal alignment issues through iterative questioning, and Few-shot scoring, which quantifies subjective aesthetic judgments by comparing multiple videos [8][13]. Group 3: Comprehensive Evaluation Metrics - Video-Bench dissects video generation quality into two orthogonal dimensions: video-condition alignment and video quality, assessing both the fidelity to text prompts and the visual quality of the video itself [10]. - The evaluation framework includes metrics for object category consistency, action consistency, color consistency, scene consistency, imaging quality, aesthetic quality, temporal consistency, and motion quality [10][11]. Group 4: Performance Comparison - Video-Bench significantly outperforms traditional methods, achieving an average Spearman correlation of 0.733 in video-condition alignment and 0.620 in video quality [18]. - In the critical metric of object category consistency, Video-Bench shows a 56.3% improvement over GRiT methods, reaching a correlation of 0.735 [19]. - A reliability test with a panel of 10 experts on 35,196 video samples yielded a consistency score (Krippendorff's α) of 0.52, comparable to human self-assessment levels [21]. Group 5: Current Model Evaluations - Video-Bench evaluated seven mainstream video generation models, revealing that commercial models generally outperform open-source models, with Gen3 scoring an average of 4.38 compared to VideoCrafter2's 3.87 [25]. - The evaluation highlighted weaknesses in dynamic dimensions such as action rationality (average score of 2.53/3) and motion blur (3.11/5) [26]. - Comparisons among foundational models indicated that GPT-4o typically excels in video quality and consistency scores, particularly in imaging quality (0.807) and video-text consistency (0.750) [27].
全面评估多模态模型视频OCR能力,Gemini 准确率仅73.7%
量子位· 2025-05-30 07:10
Core Viewpoint - The article discusses the challenges and advancements of Multi-Modal Large Language Models (MLLM) in Optical Character Recognition (OCR) for dynamic video content, highlighting the need for improved evaluation frameworks and model capabilities in this area [1][2][5]. Group 1: Model Capabilities and Challenges - MLLM has shown excellent OCR capabilities on static images but faces significant challenges when applied to dynamic video scenarios [1][2]. - The MME-VideoOCR framework aims to systematically evaluate and enhance MLLM's perception, understanding, and reasoning abilities in video OCR [3]. - Current MLLM capabilities are limited by factors such as motion blur, lighting changes, and complex temporal associations in videos, which complicate text recognition [5][21]. Group 2: Data and Task Design - MME-VideoOCR has constructed a detailed task system with 10 major task categories and 25 independent tasks, focusing on high-level capabilities like temporal understanding and complex reasoning [6][15]. - A high-quality, large-scale dataset was created, including 1,464 selected video clips and 2,000 manually annotated question-answer pairs to ensure evaluation accuracy [4][12]. Group 3: Evaluation Findings - An in-depth evaluation of 18 mainstream MLLMs revealed that even the best-performing model, Gemini-2.5 Pro, achieved only a 73.7% accuracy, indicating substantial room for improvement in video OCR tasks [7][20]. - The performance gap between closed-source and open-source models is significant, with many open-source models scoring below 60% in accuracy [20]. Group 4: Key Limitations - MLLMs struggle with tasks requiring long temporal integration and dynamic text understanding, showcasing weaknesses in temporal reasoning capabilities [21]. - There is a tendency for models to overly rely on prior language knowledge rather than effectively utilizing visual information for video text comprehension [22]. Group 5: Optimization Strategies - Providing higher resolution visual inputs and more comprehensive temporal frame coverage is crucial for enhancing MLLM performance in dynamic video scenarios [23]. - However, an increase in visual input may lead to difficulties in focusing on target information, necessitating improved information extraction and processing capabilities [23].