「世界理解」维度看AI视频生成：Veo3和Sora2水平如何？新基准来了

Core Insights - The article discusses the significant advancements in Text-to-Video (T2V) models, particularly highlighting the recent success of Sora2 and questioning whether T2V models have achieved true "world model" capabilities [1] - A new evaluation framework called VideoVerse has been proposed to assess T2V models on their understanding of event causality, physical laws, and common sense, which are essential for a "world model" [1][3] Evaluation Framework - VideoVerse aims to evaluate T2V models based on two main perspectives: dynamic aspects (event following, mechanics, interaction, material properties, camera control) and static aspects (natural constraints, common sense, attribution correctness, 2D layout, 3D depth) [3] - Each prompt corresponds to several binary evaluation questions, with event following measured through sequence consistency using Longest Common Subsequence (LCS) [4][16] Prompt Construction - The team employs a multi-stage process to ensure the authenticity, diversity, and evaluability of prompts, sourcing data from daily life, scientific experiments, and science fiction [8][9] - Event and causal structures are extracted using advanced language models to convert natural language descriptions into event-level structures, laying the groundwork for evaluating "event following" [10][11] Evaluation Methodology - The evaluation combines QA and LCS scoring, focusing on event following, dimension-specific questions, and overall scoring that reflects both logical sequence and physical details [5][18] - The introduction of hidden semantics aims to assess whether models can generate implicit consequences that are not explicitly stated in prompts [20][22] Experimental Findings - The team evaluated various open-source and closed-source models, finding that open-source models perform comparably in basic dimensions but lag significantly in world model capabilities [28] - Even the strongest closed-source model, Sora2, shows notable deficiencies in "hidden semantics following" and certain physical/material inferences [29] Conclusion and Future Directions - VideoVerse provides a comprehensive evaluation framework aimed at shifting the focus from merely generating realistic visuals to understanding and simulating the world [40] - The team has open-sourced data, evaluation code, and a leaderboard, encouraging further research to enhance world model capabilities [41]