Workflow
CVPR2025视频生成统一评估架构,上交x斯坦福联合提出让MLLM像人类一样打分
量子位·2025-06-12 08:17

Core Viewpoint - Video generation technology is rapidly transforming visual content creation across various sectors, including film production, advertising design, virtual reality, and social media, making high-quality video generation models increasingly important [1]. Group 1: Video Evaluation Framework - The Video-Bench framework evaluates AI-generated videos by simulating human cognitive processes, establishing an intelligent assessment system that connects text instructions with visual content [2]. - Video-Bench enables multimodal large models (MLLM) to evaluate videos similarly to human assessments, effectively identifying defects in object consistency (0.735 correlation) and action rationality, while also addressing traditional challenges in aesthetic quality evaluation [3]. Group 2: Innovations in Video-Bench - Video-Bench addresses two main issues in existing video evaluation methods: the inability to capture complex dimensions like video fluency and aesthetic performance, and the challenges in cross-modal comparison during video-condition alignment assessments [5]. - The framework introduces two core innovations: a dual-dimensional evaluation framework covering video-condition alignment and video quality [7], and the implementation of chain-of-query and few-shot scoring techniques [8]. Group 3: Evaluation Dimensions - The dual-dimensional evaluation framework allows Video-Bench to assess video generation quality by breaking it down into "video-condition alignment" and "video quality," focusing on the accuracy of generated content against text prompts and the visual quality of the video itself [10]. - Key dimensions for video-condition consistency include object category consistency, action consistency, color consistency, scene consistency, and video-text consistency, while video quality evaluation emphasizes imaging quality, aesthetic quality, temporal consistency, and motion quality [10]. Group 4: Performance Comparison - Video-Bench significantly outperforms traditional evaluation methods, achieving an average Spearman correlation of 0.733 in video-condition alignment and 0.620 in video quality [18]. - In the critical metric of object category consistency, Video-Bench shows a 56.3% improvement over the GRiT-based method, reaching a correlation of 0.735 [19]. Group 5: Robustness and Reliability - Video-Bench's evaluation results were validated by a team of 10 experts who annotated 35,196 video samples, achieving a Krippendorff's α of 0.52, comparable to human self-assessment levels [21]. - The framework demonstrated high stability and reliability, with a TARA@3 score of 67% and a Krippendorff's α of 0.867, confirming the effectiveness of its component designs [23]. Group 6: Current Model Assessment - Video-Bench evaluated seven mainstream video generation models, revealing that commercial models generally outperform open-source models, with Gen3 scoring an average of 4.38 compared to VideoCrafter2's 3.87 [25]. - The assessment highlighted weaknesses in dynamic dimensions such as action rationality (average score of 2.53/3) and motion blur (3.11/5) across current models [26].