Core Viewpoint - The article discusses the advancements in AI-driven digital human video generation, particularly focusing on the limitations of current methods and the introduction of the StableAvatar framework to achieve high-fidelity, infinite-length audio-driven video generation [2][5]. Group 1: Current Limitations - Existing methods for generating audio-driven human videos can only produce clips shorter than 15 seconds, leading to noticeable body distortions and inconsistencies, especially in facial areas when attempting longer videos [2][3]. - Current strategies, such as motion frame utilization and sliding window mechanisms, can improve video smoothness but do not fundamentally address the quality degradation in infinite-length video generation [2][3]. Group 2: Proposed Solutions - The StableAvatar framework, developed by research teams from Fudan, Microsoft, and XJTU, aims to enable infinite-length, high-fidelity audio-driven human video generation, with open-source code available for both inference and training [5]. - The framework utilizes a novel Timestep-aware Audio Adapter to optimize audio embeddings, reducing the accumulation of latent distribution errors that occur during the video generation process [11]. Group 3: Technical Innovations - The audio embeddings are processed through a denoising diffusion model, with a new Audio Native Guidance method introduced to enhance lip-sync and facial expression generation by integrating audio features with latent variables [9][15]. - A dynamic weighted sliding-window strategy is implemented to ensure that overlapping latent variables from adjacent windows maintain a coherent feature mix, enhancing the overall video quality [17].
你能永远陪我聊天吗?复旦&微软提出StableAvatar: 首个端到端无限时长音频驱动的人类视频生成新框架!
机器之心·2025-08-30 04:12