Workflow
多模态统一训练
icon
Search documents
统一视觉多模态!港科大团队发布视频生成模型,加速真实世界理解
具身智能之心· 2025-12-17 00:05
Core Insights - The article discusses the introduction of UnityVideo, a new unified multimodal video generation model developed by research teams from Hong Kong University of Science and Technology, Chinese University of Hong Kong, Tsinghua University, and Kuaishou. This model enhances video generation quality and achieves zero-shot generalization, allowing it to generate reasonable results for previously unseen objects or scenes [1][2][10]. Group 1: Model Capabilities - UnityVideo utilizes unified training across various visual modalities, such as depth maps, optical flow, skeletons, and segmentation masks, enabling the model to better understand the physical world and produce more realistic and controllable videos [4][10]. - The model exhibits strong zero-shot generalization capabilities, allowing it to adapt from single-person data to multi-person scenarios and from human skeleton data to animal skeleton estimation [13][15]. - The unified training paradigm significantly improves performance, as different visual modalities provide complementary supervisory signals that enhance the model's understanding of physical world operations [12][14]. Group 2: Technical Innovations - UnityVideo implements dynamic task routing, seamlessly integrating three training paradigms: instance segmentation, dense pose understanding, and depth estimation, which helps the model distinguish between different object categories and understand human body structures [16][17]. - A key technical breakthrough is the dynamic noise scheduling strategy, which allows the model to randomly select training modes during iterations, preventing catastrophic forgetting and ensuring harmonious coexistence of training objectives [20][21]. - The architecture includes a context learner that injects specific text prompts for different modalities, enhancing the model's semantic understanding and enabling it to generalize from "two persons" to "two objects" in segmentation tasks [23][52]. Group 3: Dataset and Evaluation - The research team constructed the OpenUni dataset, comprising 1.3 million multimodal video samples, ensuring balanced sampling across all modalities and data sources to prevent overfitting [31]. - UnityVideo achieved superior performance across various tasks, with background consistency reaching 97.44% and aesthetic quality at 64.12% in text-to-video generation, outperforming other models [35]. - Qualitative results demonstrate UnityVideo's enhanced understanding of physical phenomena, such as light refraction in water, and its ability to maintain overall video quality while adhering to depth guidance [38][39]. Group 4: User Study and Generalization - In user studies, UnityVideo received the highest scores in physical quality (38.50%), semantic quality, and overall preference, significantly surpassing commercial models [50][51]. - The model's ability to generalize from seen to unseen data showcases its understanding of semantic levels, indicating a deeper comprehension of modality interactions during training [56][58]. - The evolution of cross-modal attention highlights that true world understanding requires the integration of multidimensional perceptions, similar to human cognitive processes [59][60].
统一视觉多模态与多任务!快手可灵与港科大团队发布视频生成模型,加速真实世界理解
量子位· 2025-12-14 07:12
Core Insights - The article introduces UnityVideo, a new visual framework developed by research teams from Hong Kong University of Science and Technology, Chinese University of Hong Kong, Tsinghua University, and Kuaishou, which enhances video generation by integrating multiple visual modalities [1][3][4]. Group 1: Model Capabilities - UnityVideo utilizes unified training across various visual modalities such as depth maps, optical flow, skeletons, and segmentation masks, allowing the model to better understand the physical world and generate more realistic and controllable videos [3][12]. - The model demonstrates zero-shot generalization, enabling it to generate reasonable results for previously unseen objects or scenes [4][16]. - The unified training approach significantly accelerates convergence speed and improves performance in RGB video generation tasks compared to single modality training [15][16]. Group 2: Technical Innovations - UnityVideo features dynamic task routing, allowing seamless integration of three training paradigms within a single architecture [19]. - A key breakthrough is the dynamic noise scheduling strategy, which randomly selects training modes during iterations, preventing catastrophic forgetting and enabling harmonious coexistence of multiple training objectives [21][22]. - The model incorporates a context learner and a modality-adaptive switcher to effectively distinguish between different modality signals, enhancing its ability to generalize across tasks [27][30]. Group 3: Training Strategy - UnityVideo employs a two-phase curriculum learning strategy, first training on carefully selected single-person scene data to establish spatial correspondence, followed by introducing all modalities and diverse scene data [33][35]. - The OpenUni dataset, containing 1.3 million multimodal video samples, supports this unified training paradigm, ensuring balanced sampling across modalities [35][36]. Group 4: Performance Results - UnityVideo outperforms existing models in various tasks, achieving high scores in physical reasoning, controllable generation, and modality estimation [39][41]. - The model's qualitative results demonstrate superior understanding of physical phenomena, such as light refraction in water, and maintains high video quality without common issues like background flickering [41][42]. - In quantitative comparisons, UnityVideo achieves a background consistency score of 97.44% and an aesthetic quality score of 64.12% in text-to-video generation tasks [44]. Group 5: Generalization and Understanding - The model exhibits strong generalization capabilities, accurately estimating unseen data and overcoming overfitting issues common in specialized models [43][56]. - UnityVideo's design emphasizes the importance of integrating multiple dimensions of perception, akin to human understanding, which enhances its ability to model physical laws and improve overall video generation quality [60][65].