夸克、浙大开源OmniAvatar，一张图+一段音，就能生成长视频

Core Insights - OmniAvatar is an innovative audio-driven full-body video generation model that requires only an image and an audio input to create corresponding videos, significantly enhancing lip-sync details and fluidity of full-body movements [1][6] - The model allows for precise control over character poses, emotions, and scenes through prompt words, showcasing its versatility in various applications [1][10] Performance Metrics - Experimental results indicate that OmniAvatar outperforms existing methods in lip-sync accuracy, facial and upper-body video generation, and text control, achieving a balance among video quality, accuracy, and aesthetics [3] - In comparison to other models, OmniAvatar achieved a FID score of 67.6 and a FVD score of 664, indicating superior performance in video generation tasks [5] Technical Innovations - OmniAvatar is based on the Wan2.1-T2V-14B model and utilizes LoRA for fine-tuning, effectively integrating audio features while maintaining the model's strong video generation capabilities [8] - The model employs a pixel-level audio embedding strategy that allows audio features to be integrated directly into the model's latent space, ensuring natural lip movements and coordinated body actions [13] Long Video Generation - The model has been optimized for long video generation, ensuring character consistency and temporal coherence through reference frame embedding and overlapping frame strategies [6][19] - By using a reference frame as a fixed guide for character identity and a latent overlapping strategy for seamless video continuity, OmniAvatar effectively anchors character identity across long video sequences [20] Future Directions - OmniAvatar represents an initial attempt in multi-modal video generation, with preliminary validation on experimental datasets, but it has not yet reached product-level application [22] - Future developments will focus on enhancing complex instruction processing capabilities and multi-character interactions to expand the model's applicability in more scenarios [22]