全球功能最全的视频生成模型来了

Core Viewpoint - Alibaba has launched the new Tongyi Wansxiang 2.6 model, which is the most comprehensive video generation model globally, covering various capabilities such as text-to-video, image generation, and audio-driven video creation [1]. Group 1: Video Generation Capabilities - The Wansxiang 2.6 model introduces multi-audio driven video capabilities, along with features like audio-visual synchronization and multi-shot storytelling, which were not available in Sora 2 [2]. - The model demonstrates significant improvements in artistic style control, realistic portrait generation, and understanding of historical and cultural semantics in image generation [3][8]. - The model's video generation capabilities include video reference generation, maintaining subject consistency, and natural audio-visual synchronization, which enhances the overall user experience [11][12]. Group 2: Performance Testing - Initial tests show that Wansxiang 2.6 performs well in video subject consistency and prompt understanding, achieving a near 1:1 replication of the subject's appearance and matching lip movements accurately [11]. - The model's ability to generate multi-shot narratives is effective, with smooth transitions and coherent storytelling across different shots, although some abstract actions may still pose challenges [17][18]. - The model's aesthetic quality in video generation has improved, showcasing a cinematic feel and strong visual appeal, particularly in complex scenes like cyberpunk cityscapes [14][24]. Group 3: Image Generation Enhancements - Wansxiang 2.6 has made advancements in image generation, particularly in style transfer, portrait generation, and bilingual text handling, demonstrating a better grasp of new aesthetic styles [19][22]. - The model successfully generated a food promotional poster with clear bilingual text and an appealing layout, indicating its reliability in aesthetic judgment [25][27]. - Overall, the model's performance is commendable, with minor flaws in multi-character dialogue and complex action understanding, but it is deemed usable for daily short video creation and secondary creation tasks [28][29].