链式查询(Chain-of-Query)

Search documents
CVPR2025视频生成统一评估架构,上交x斯坦福联合提出让MLLM像人类一样打分
量子位· 2025-06-12 08:17
Video-Bench 视频评估框架,能够通过模拟人类的认知过程,建立起连接文本指令与视觉内容的智能评估体系。 简单地说,能够让多模态大模型(MLLM)"像人一样评估视频"。 实验结果表明,Video-Bench不仅能精准识别生成视频在物体一致性(0.735相关性)、动作合理性等维度的缺陷,还能稳定评估美学质量等 传统难题,显著优于现有的评估方法。 Video-Bench团队 投稿 量子位 | 公众号 QbitAI 视频生成技术正以前所未有的速度革新着当前的视觉内容创作方式,从电影制作到广告设计,从虚拟现实到社交媒体,高质量且符合人类期望 的视频生成模型正变得越来越重要。 那么,要如何评估AI生成的视频是否符合人类的审美和需求呢? Video-Bench的研究团队来自上海交通大学、斯坦福大学、卡内基梅隆大学等机构。 Video-Bench:基于MLLM的自动化视频评估框架 Video-Bench团队在面对已有的视频评估方法时,发现了两个问题: 1.简单的评分规则往往无法捕捉视频流畅度、美学表现等复杂维度—— 那么,当评判"视频质量"时,如何将人类出于"直觉"的模糊感受转化为可量化的评估指标? 2.现有基于大语 ...
CVPR2025视频生成统一评估架构,上交x斯坦福联合提出让MLLM像人类一样打分
量子位· 2025-06-12 08:16
Core Viewpoint - Video generation technology is rapidly transforming visual content creation across various sectors, emphasizing the importance of high-quality video generation models that align with human expectations [1]. Group 1: Video Evaluation Framework - The Video-Bench framework simulates human cognitive processes to establish an intelligent evaluation system that connects text instructions with visual content [2]. - Video-Bench enables multimodal large models (MLLM) to evaluate videos similarly to human assessments, identifying defects in object consistency (0.735 correlation) and action rationality, while also effectively assessing aesthetic quality [3][18]. Group 2: Innovations in Video Evaluation - Video-Bench addresses two main issues in existing video evaluation methods: the inability to capture complex dimensions like video fluency and aesthetics, and the challenges in cross-modal comparisons for video-text alignment [5]. - The framework introduces a dual-dimensional evaluation system covering video-condition alignment and video quality [7]. - Key technologies include Chain-of-Query, which resolves cross-modal alignment issues through iterative questioning, and Few-shot scoring, which quantifies subjective aesthetic judgments by comparing multiple videos [8][13]. Group 3: Comprehensive Evaluation Metrics - Video-Bench dissects video generation quality into two orthogonal dimensions: video-condition alignment and video quality, assessing both the fidelity to text prompts and the visual quality of the video itself [10]. - The evaluation framework includes metrics for object category consistency, action consistency, color consistency, scene consistency, imaging quality, aesthetic quality, temporal consistency, and motion quality [10][11]. Group 4: Performance Comparison - Video-Bench significantly outperforms traditional methods, achieving an average Spearman correlation of 0.733 in video-condition alignment and 0.620 in video quality [18]. - In the critical metric of object category consistency, Video-Bench shows a 56.3% improvement over GRiT methods, reaching a correlation of 0.735 [19]. - A reliability test with a panel of 10 experts on 35,196 video samples yielded a consistency score (Krippendorff's α) of 0.52, comparable to human self-assessment levels [21]. Group 5: Current Model Evaluations - Video-Bench evaluated seven mainstream video generation models, revealing that commercial models generally outperform open-source models, with Gen3 scoring an average of 4.38 compared to VideoCrafter2's 3.87 [25]. - The evaluation highlighted weaknesses in dynamic dimensions such as action rationality (average score of 2.53/3) and motion blur (3.11/5) [26]. - Comparisons among foundational models indicated that GPT-4o typically excels in video quality and consistency scores, particularly in imaging quality (0.807) and video-text consistency (0.750) [27].