长视频AI数字人来了！字节×浙大推出商用级音频驱动数字人模型InfinityHuman

Core Viewpoint - The article discusses the launch of InfinityHuman, a commercial-grade long-sequence audio-driven video generation model developed by ByteDance's GenAI team in collaboration with Zhejiang University, aimed at addressing the industry's pain points in high-quality digital human video creation [2][6]. Group 1: Technology Breakthroughs - InfinityHuman can generate coherent, high-resolution long videos from a single image and corresponding audio, enabling professional-grade presentations for various formats, from 30-second product pitches to 3-minute speeches [4][11]. - The model effectively addresses two major challenges in long video animation: identity drift and detail distortion, ensuring consistent facial features and natural hand movements throughout the video [8][14]. Group 2: Commercial Applications - InfinityHuman has been successfully applied in multiple commercial scenarios, particularly excelling in supporting Chinese speech, maintaining identity stability and natural hand movements in longer videos [7][13]. - Potential applications include virtual hosts for e-commerce, virtual instructors for corporate training, and digital human anchors for content creation in social media [8][15]. Group 3: Technical Framework - The model employs a unified framework that generates long, high-resolution speaking videos using a reference image, audio, and optional text prompts, ensuring visual consistency and accurate lip synchronization [11][16]. - It utilizes a "coarse-to-fine" strategy, starting with low-resolution video generation and refining it through a pose-guided module to enhance realism and structural integrity of hand movements [11][16]. Group 4: Performance Metrics - Experimental results indicate that InfinityHuman outperforms mainstream baseline methods in visual realism and temporal coherence, with significant improvements in overall video quality [13][14]. - The model maintains identity consistency and enhances hand movement accuracy, particularly in complex gesture scenarios, addressing common issues like finger distortion and joint anomalies [13][14].