Workflow
快手可灵团队提出MIDAS:压缩比64倍、延迟低于500ms,多模态互动数字人框架实现交互生成新突破

Core Viewpoint - The article discusses the rapid development of digital human video generation technology, highlighting the introduction of the MIDAS framework by Kuaishou's Kling Team, which addresses significant challenges in real-time, multimodal control, and long-term consistency in digital human interactions [2][16]. Group 1: MIDAS Framework Overview - MIDAS (Multimodal Interactive Digital-human Synthesis) combines autoregressive video generation with lightweight diffusion denoising heads to achieve real-time, smooth digital human video synthesis under multimodal conditions [2][5]. - The system demonstrates three core advantages: high compression rates, low latency, and efficient denoising, making it suitable for real-time interactive applications [4][14]. Group 2: Technical Innovations - The framework utilizes a 64× compression ratio autoencoder, reducing each frame to a maximum of 60 tokens, significantly lowering computational load [4][8]. - MIDAS supports various input signals, including audio, posture, and text, through a unified multimodal condition projector that encodes different modalities into a shared latent space [5][12]. - The model architecture employs a Qwen2.5-3B autoregressive backbone with a diffusion head based on PixArt-α/mlp structure, ensuring coherence in generated outputs while minimizing computational delays [12][16]. Group 3: Training and Data - A large-scale multimodal dialogue dataset of approximately 20,000 hours was constructed to train the model, encompassing single and dual dialogue scenarios across multiple languages and styles [10][12]. - The training strategy includes controllable noise injection to mitigate exposure bias during inference, enhancing the model's performance [12]. Group 4: Application Scenarios - MIDAS can generate real-time dual-person dialogue, synchronizing lip movements, expressions, and listening postures with audio streams [13]. - The model achieves cross-language singing synthesis without explicit language identifiers, maintaining lip-sync across Chinese, Japanese, and English songs for videos up to 4 minutes long [13][14]. - MIDAS demonstrates potential as an interactive world model by responding to directional control signals in environments like Minecraft, showcasing scene consistency and memory capabilities [13][14]. Group 5: Future Directions - The team plans to explore higher resolution and more complex interaction logic in future developments, aiming to deploy the system in real product environments [17].