Workflow
自回归视频生成
icon
Search documents
快手可灵团队提出MIDAS:压缩比64倍、延迟低于500ms,多模态互动数字人框架实现交互生成新突破
机器之心· 2025-09-13 08:54
数字人视频生成技术正迅速成为增强人机交互体验的核心手段之一。然而,现有方法在实现低延迟、多模态控制与长时序一致性方面仍存在显著挑战。大多数系 统要么计算开销巨大,无法实时响应,要么只能处理单一模态输入,缺乏真正的交互能力。 为了解决这些问题, 快手可灵团队( Kling Team) 提出了一种名为 MIDAS(Multimodal Interactive Digital-human Synthesis)的新型框架,通过自回归视频生成结 合轻量化扩散去噪头,实现了多模态条件下实时、流畅的数字人视频合成。该系统具备三大核心优势: 该项研究已被广泛实验验证,在多语言对话、歌唱合成甚至交互式世界建模等任务中表现出色,为数字人实时交互提供了全新解决方案。 | Ming Chen1* | Liyuan Cui1,2* | Wenyuan Zhang1,3* | Haoxian Zhang1 | | --- | --- | --- | --- | | Yan Zhou1 | Xiaohan Li1 | Songlin Tang- | Jiwen Liu1 | | Borui Liao1 | Hejia Chen1 | Xi ...
每秒生成超30帧视频,支持实时交互!自回归视频生成新框架刷新生成效率
量子位· 2025-06-12 01:37
Core Viewpoint - The article discusses the advancements in video generation technology through the introduction of the Next-Frame Diffusion (NFD) framework developed by a collaboration between Microsoft Research and Peking University, which significantly enhances both the quality and efficiency of video generation [1][2]. Group 1: Video Generation Efficiency - NFD achieves video generation at over 30 frames per second while maintaining high quality, utilizing NVIDIA A100 GPUs [1][4]. - The framework allows for frame-wise parallel sampling and inter-frame autoregressive generation, leading to a substantial increase in generation efficiency [2][18]. - Compared to previous models, NFD can generate videos in approximately 0.48 seconds per frame on the A100 GPU [4]. Group 2: Technical Innovations - NFD employs a unique modeling approach using frame-wise bidirectional attention and inter-frame causal attention mechanisms, which improves the modeling of temporal dependencies [21][25]. - The architecture includes a tokenizer for converting visual signals into tokens and a diffusion-based transformer model that reduces computational costs by 50% compared to traditional 3D full attention methods [26][25]. - The training process is based on Flow Matching, which simplifies the training of continuous time consistency models for video data [27][28]. Group 3: Performance Comparison - NFD outperforms previous autoregressive models in multiple metrics, achieving a Fréchet Video Distance (FVD) of 212 and a Peak Signal-to-Noise Ratio (PSNR) of 16.46, while running at 6.15 frames per second [35]. - The accelerated version, NFD+, achieves even higher performance with 42.46 FPS for the 130M model and 31.14 FPS for the 310M model, while maintaining competitive visual quality [36][37]. - NFD+ retains a PSNR of 16.83 and an FVD of 227, comparable to larger models like MineWorld [37]. Group 4: Future Implications - The advancements in video generation models, such as NFD, indicate a growing trend towards more flexible and efficient generation paradigms, which could lead to innovative applications in gaming and interactive media [15][35]. - The research highlights the potential for direct interaction between players and models in gaming environments, moving away from traditional game engines [3][15].