digihuman-阿里开源Wan2.2-S2V模型：静态图与音频合成电影级数字人视频

Core Insights - Alibaba has launched its latest multimodal video generation model, Wan2.2-S2V, which has garnered significant attention in the industry due to its advanced capabilities [1] - The model allows users to generate high-quality digital human videos by simply providing a static image and an audio clip, achieving natural facial expressions and synchronized lip movements [1] - Wan2.2-S2V supports various image types and can create videos lasting up to several minutes, which is a leading feature in the industry [1] User Experience - The model is available for user experience on platforms like Hugging Face and the Magic Dock community, allowing for direct downloads and trials on the official website [1] - Users can upload images of different subjects, including humans, cartoons, and animals, and the model will animate them to speak, sing, or perform based on the provided audio [1] Technical Innovations - Wan2.2-S2V integrates multiple innovative technologies, including global motion control guided by text and fine-grained local motion driven by audio, enabling efficient video generation in complex scenarios [3] - The model employs AdaIN and CrossAttention mechanisms for more accurate and dynamic audio control, ensuring high-quality long video generation through hierarchical frame compression [3] - Alibaba's team trained the model on a dataset containing over 600,000 audio-video segments, utilizing mixed parallel training to maximize performance potential [3] Performance Metrics - Wan2.2-S2V has achieved the best results among similar models in key metrics such as video quality, expression realism, and identity consistency [4] - Since February of this year, the company has open-sourced several video generation models, with downloads exceeding 20 million, making it one of the most popular models in the open-source community [4]