「视频世界模型」新突破：AI连续生成5分钟，画面也不崩

Core Insights - The article discusses the emergence of AI-generated videos and the challenges of creating videos that not only look realistic but also adhere to the laws of the physical world, which is the focus of the "Video World Model" [2] - The LongVie 2 framework is introduced as a solution to generate high-fidelity, controllable videos lasting up to 5 minutes, addressing the limitations of existing models [2][6] Group 1: Challenges in Current Video Models - Current video world models face a common issue where increasing generation length leads to a decline in controllability, visual fidelity, and temporal consistency [6] - The degradation of quality in long video generation is nearly unavoidable, with issues such as visual degradation and logical inconsistencies becoming significant bottlenecks [2][12] Group 2: LongVie 2 Framework - LongVie 2 employs a three-stage progressive training strategy to enhance controllability, stability, and temporal consistency [9][14] - Stage 1 focuses on Dense & Sparse multimodal control, utilizing dense signals (like depth maps) and sparse signals (like keypoint trajectories) to provide stable and interpretable world constraints [9] - Stage 2 introduces degradation-aware training, where the model learns to maintain stability in generation despite imperfect inputs, significantly improving long-term visual fidelity [13] - Stage 3 incorporates historical context modeling, explicitly integrating information from previous segments to ensure smoother transitions and reduce semantic breaks [14] Group 3: Performance Metrics - LongVie 2 demonstrates superior controllability compared to existing methods, achieving state-of-the-art (SOTA) levels in various metrics [21][29] - Ablation studies validate the effectiveness of the three-stage training approach, showing improvements in quality, controllability, and temporal consistency across multiple indicators [26] Group 4: LongVGenBench - The article introduces LongVGenBench, the first standardized benchmark dataset designed for controllable long video generation, containing 100 high-resolution videos over 1 minute in length [28] - This benchmark aims to facilitate systematic research and fair evaluation in the field of long video generation [28]