Core Insights - The article introduces UnityVideo, a new visual framework developed by research teams from Hong Kong University of Science and Technology, Chinese University of Hong Kong, Tsinghua University, and Kuaishou, which enhances video generation by integrating multiple visual modalities [1][3][4]. Group 1: Model Capabilities - UnityVideo utilizes unified training across various visual modalities such as depth maps, optical flow, skeletons, and segmentation masks, allowing the model to better understand the physical world and generate more realistic and controllable videos [3][12]. - The model demonstrates zero-shot generalization, enabling it to generate reasonable results for previously unseen objects or scenes [4][16]. - The unified training approach significantly accelerates convergence speed and improves performance in RGB video generation tasks compared to single modality training [15][16]. Group 2: Technical Innovations - UnityVideo features dynamic task routing, allowing seamless integration of three training paradigms within a single architecture [19]. - A key breakthrough is the dynamic noise scheduling strategy, which randomly selects training modes during iterations, preventing catastrophic forgetting and enabling harmonious coexistence of multiple training objectives [21][22]. - The model incorporates a context learner and a modality-adaptive switcher to effectively distinguish between different modality signals, enhancing its ability to generalize across tasks [27][30]. Group 3: Training Strategy - UnityVideo employs a two-phase curriculum learning strategy, first training on carefully selected single-person scene data to establish spatial correspondence, followed by introducing all modalities and diverse scene data [33][35]. - The OpenUni dataset, containing 1.3 million multimodal video samples, supports this unified training paradigm, ensuring balanced sampling across modalities [35][36]. Group 4: Performance Results - UnityVideo outperforms existing models in various tasks, achieving high scores in physical reasoning, controllable generation, and modality estimation [39][41]. - The model's qualitative results demonstrate superior understanding of physical phenomena, such as light refraction in water, and maintains high video quality without common issues like background flickering [41][42]. - In quantitative comparisons, UnityVideo achieves a background consistency score of 97.44% and an aesthetic quality score of 64.12% in text-to-video generation tasks [44]. Group 5: Generalization and Understanding - The model exhibits strong generalization capabilities, accurately estimating unseen data and overcoming overfitting issues common in specialized models [43][56]. - UnityVideo's design emphasizes the importance of integrating multiple dimensions of perception, akin to human understanding, which enhances its ability to model physical laws and improve overall video generation quality [60][65].
统一视觉多模态与多任务!快手可灵与港科大团队发布视频生成模型,加速真实世界理解