统一视觉多模态!港科大团队发布视频生成模型,加速真实世界理解
具身智能之心·2025-12-17 00:05

Core Insights - The article discusses the introduction of UnityVideo, a new unified multimodal video generation model developed by research teams from Hong Kong University of Science and Technology, Chinese University of Hong Kong, Tsinghua University, and Kuaishou. This model enhances video generation quality and achieves zero-shot generalization, allowing it to generate reasonable results for previously unseen objects or scenes [1][2][10]. Group 1: Model Capabilities - UnityVideo utilizes unified training across various visual modalities, such as depth maps, optical flow, skeletons, and segmentation masks, enabling the model to better understand the physical world and produce more realistic and controllable videos [4][10]. - The model exhibits strong zero-shot generalization capabilities, allowing it to adapt from single-person data to multi-person scenarios and from human skeleton data to animal skeleton estimation [13][15]. - The unified training paradigm significantly improves performance, as different visual modalities provide complementary supervisory signals that enhance the model's understanding of physical world operations [12][14]. Group 2: Technical Innovations - UnityVideo implements dynamic task routing, seamlessly integrating three training paradigms: instance segmentation, dense pose understanding, and depth estimation, which helps the model distinguish between different object categories and understand human body structures [16][17]. - A key technical breakthrough is the dynamic noise scheduling strategy, which allows the model to randomly select training modes during iterations, preventing catastrophic forgetting and ensuring harmonious coexistence of training objectives [20][21]. - The architecture includes a context learner that injects specific text prompts for different modalities, enhancing the model's semantic understanding and enabling it to generalize from "two persons" to "two objects" in segmentation tasks [23][52]. Group 3: Dataset and Evaluation - The research team constructed the OpenUni dataset, comprising 1.3 million multimodal video samples, ensuring balanced sampling across all modalities and data sources to prevent overfitting [31]. - UnityVideo achieved superior performance across various tasks, with background consistency reaching 97.44% and aesthetic quality at 64.12% in text-to-video generation, outperforming other models [35]. - Qualitative results demonstrate UnityVideo's enhanced understanding of physical phenomena, such as light refraction in water, and its ability to maintain overall video quality while adhering to depth guidance [38][39]. Group 4: User Study and Generalization - In user studies, UnityVideo received the highest scores in physical quality (38.50%), semantic quality, and overall preference, significantly surpassing commercial models [50][51]. - The model's ability to generalize from seen to unseen data showcases its understanding of semantic levels, indicating a deeper comprehension of modality interactions during training [56][58]. - The evolution of cross-modal attention highlights that true world understanding requires the integration of multidimensional perceptions, similar to human cognitive processes [59][60].

统一视觉多模态!港科大团队发布视频生成模型,加速真实世界理解 - Reportify