Core Viewpoint - The article highlights the launch of Alibaba's new AI video generation model, Wan2.2-S2V, which allows users to create high-quality digital human videos using just an image and an audio clip, marking a significant advancement in AI video technology [1][3]. Group 1: Model Features - Wan2.2-S2V boasts improved naturalness and fluidity in character movements, particularly in generating various cinematic scenarios [3]. - The model can generate videos in minutes, offering stability and consistency, along with cinema-level audio capabilities [5]. - It supports advanced action and environmental control based on user instructions [5]. Group 2: User Experience - The model has been well-received by users, with many sharing positive experiences and creative applications, such as generating animated characters reciting poetry [6][15]. - Users can access the model for free on the Tongyi Wanxiang website, where they can upload audio or choose from a voice library [2][11]. Group 3: Technical Innovations - Wan2.2-S2V utilizes a dataset of over 600,000 audio-video segments and employs mixed parallel training for full parameterization, enhancing model performance [19]. - The model integrates text-guided global motion control and audio-driven fine-grained local motion to achieve complex scene generation [19]. - It introduces AdaIN and CrossAttention mechanisms to synchronize audio and visuals effectively [20]. Group 4: Model Capabilities - The model can generate long videos by employing hierarchical frame compression, expanding the length of motion frames from several frames to 73 frames [21]. - It supports multi-resolution training, allowing for video generation in various formats, including vertical short videos and horizontal films [22]. - With the release of Wan2.2-S2V, Alibaba's Tongyi model family has surpassed 20 million downloads across open-source communities and third-party platforms [23].
阿里开源14B电影级视频模型!实测来了:免费可玩,单次生成时长可达分钟级