自回归视频生成
Search documents
 Self-Forcing++:让自回归视频生成模型突破 4 分钟时长极限
 机器之心· 2025-10-18 08:30
 Core Insights - The article discusses the breakthrough of Self-Forcing++ in generating high-quality long videos, extending the generation time from 5 seconds to 4 minutes without requiring additional long video data for retraining [2][10].   Group 1: Challenges in Long Video Generation - Long video generation has been limited to a few seconds due to inherent architectural flaws in existing models, which struggle to maintain visual consistency and motion coherence beyond 10 seconds [6][7]. - The primary challenge lies in the models' inability to handle cumulative errors over extended sequences, leading to issues like overexposure and freezing [17][20].   Group 2: Key Innovations of Self-Forcing++ - Self-Forcing++ employs a unique approach where a teacher model, despite only generating 5-second videos, can correct distortions in longer videos generated by a student model [9][10]. - The process involves a cycle of generation, distortion, correction, and learning, allowing the model to self-repair and stabilize over longer time scales [10].   Group 3: Technical Mechanisms - Backward Noise Initialization allows the model to inject noise into already generated sequences, maintaining temporal continuity [13][15]. - Extended DMD expands the teacher-student distribution alignment to a sliding window, enabling local supervision of long video sequences [16][18]. - Rolling KV Cache aligns training and inference phases, eliminating issues like exposure drift and frame repetition [19][20].   Group 4: Experimental Results - Self-Forcing++ outperforms baseline models in generating videos of 50, 75, and 100 seconds, demonstrating superior stability and quality [23][24]. - The model maintains consistent brightness and natural motion across long videos, with minimal degradation in visual quality [30].   Group 5: Scaling and Future Improvements - The relationship between computational power and video length is explored, showing that increasing training resources significantly enhances video quality [31]. - Despite advancements, challenges remain in long-term memory retention and training efficiency, indicating areas for further development [33].
 快手可灵团队提出MIDAS:压缩比64倍、延迟低于500ms,多模态互动数字人框架实现交互生成新突破
 机器之心· 2025-09-13 08:54
 Core Viewpoint - The article discusses the rapid development of digital human video generation technology, highlighting the introduction of the MIDAS framework by Kuaishou's Kling Team, which addresses significant challenges in real-time, multimodal control, and long-term consistency in digital human interactions [2][16].   Group 1: MIDAS Framework Overview - MIDAS (Multimodal Interactive Digital-human Synthesis) combines autoregressive video generation with lightweight diffusion denoising heads to achieve real-time, smooth digital human video synthesis under multimodal conditions [2][5]. - The system demonstrates three core advantages: high compression rates, low latency, and efficient denoising, making it suitable for real-time interactive applications [4][14].   Group 2: Technical Innovations - The framework utilizes a 64× compression ratio autoencoder, reducing each frame to a maximum of 60 tokens, significantly lowering computational load [4][8]. - MIDAS supports various input signals, including audio, posture, and text, through a unified multimodal condition projector that encodes different modalities into a shared latent space [5][12]. - The model architecture employs a Qwen2.5-3B autoregressive backbone with a diffusion head based on PixArt-α/mlp structure, ensuring coherence in generated outputs while minimizing computational delays [12][16].   Group 3: Training and Data - A large-scale multimodal dialogue dataset of approximately 20,000 hours was constructed to train the model, encompassing single and dual dialogue scenarios across multiple languages and styles [10][12]. - The training strategy includes controllable noise injection to mitigate exposure bias during inference, enhancing the model's performance [12].   Group 4: Application Scenarios - MIDAS can generate real-time dual-person dialogue, synchronizing lip movements, expressions, and listening postures with audio streams [13]. - The model achieves cross-language singing synthesis without explicit language identifiers, maintaining lip-sync across Chinese, Japanese, and English songs for videos up to 4 minutes long [13][14]. - MIDAS demonstrates potential as an interactive world model by responding to directional control signals in environments like Minecraft, showcasing scene consistency and memory capabilities [13][14].   Group 5: Future Directions - The team plans to explore higher resolution and more complex interaction logic in future developments, aiming to deploy the system in real product environments [17].
 每秒生成超30帧视频,支持实时交互!自回归视频生成新框架刷新生成效率
 量子位· 2025-06-12 01:37
 Core Viewpoint - The article discusses the advancements in video generation technology through the introduction of the Next-Frame Diffusion (NFD) framework developed by a collaboration between Microsoft Research and Peking University, which significantly enhances both the quality and efficiency of video generation [1][2].   Group 1: Video Generation Efficiency - NFD achieves video generation at over 30 frames per second while maintaining high quality, utilizing NVIDIA A100 GPUs [1][4]. - The framework allows for frame-wise parallel sampling and inter-frame autoregressive generation, leading to a substantial increase in generation efficiency [2][18]. - Compared to previous models, NFD can generate videos in approximately 0.48 seconds per frame on the A100 GPU [4].   Group 2: Technical Innovations - NFD employs a unique modeling approach using frame-wise bidirectional attention and inter-frame causal attention mechanisms, which improves the modeling of temporal dependencies [21][25]. - The architecture includes a tokenizer for converting visual signals into tokens and a diffusion-based transformer model that reduces computational costs by 50% compared to traditional 3D full attention methods [26][25]. - The training process is based on Flow Matching, which simplifies the training of continuous time consistency models for video data [27][28].   Group 3: Performance Comparison - NFD outperforms previous autoregressive models in multiple metrics, achieving a Fréchet Video Distance (FVD) of 212 and a Peak Signal-to-Noise Ratio (PSNR) of 16.46, while running at 6.15 frames per second [35]. - The accelerated version, NFD+, achieves even higher performance with 42.46 FPS for the 130M model and 31.14 FPS for the 310M model, while maintaining competitive visual quality [36][37]. - NFD+ retains a PSNR of 16.83 and an FVD of 227, comparable to larger models like MineWorld [37].    Group 4: Future Implications - The advancements in video generation models, such as NFD, indicate a growing trend towards more flexible and efficient generation paradigms, which could lead to innovative applications in gaming and interactive media [15][35].  - The research highlights the potential for direct interaction between players and models in gaming environments, moving away from traditional game engines [3][15].