视频生成Prompt何须仅是文字！字节&港中文发布Video-As-Prompt

Core Insights - The article introduces a novel semantic-controlled video generation framework called Video-As-Prompt, which allows users to provide a reference video and a semantic description to generate new content, fundamentally unifying the approach to abstract semantic-controlled video generation [3][20]. Group 1: Framework Overview - Video-As-Prompt leverages a "video reference" paradigm, enabling the model to "clone" specified semantics and apply them to new content, thus addressing the complexity of training separate models for each semantic condition [3][20]. - The framework is built on a large-scale dataset, VAP-Data, which includes over 100,000 videos covering more than 100 different high-quality semantic conditions, facilitating extensive training and evaluation [15][21]. Group 2: Technical Implementation - The architecture employs a Mixture-of-Transformers (MoTs) approach, combining a frozen video diffusion Transformer (DiT) with a trainable parallel expert Transformer to enhance model generalization and prevent catastrophic forgetting during training [11][13]. - By treating reference videos as "video prompts," the framework establishes a unified semantic mapping, significantly improving the model's versatility and user-friendliness [9][10]. Group 3: Performance and Applications - Video-As-Prompt demonstrates strong performance in overall video quality, text consistency, and semantic coherence, outperforming other open-source baselines and matching the performance of closed-source models [18]. - The framework supports various applications, including driving the same image with different reference videos and enabling zero-shot generation capabilities when presented with unseen semantic references [5][18].