Workflow
自回归视频生成
icon
Search documents
视频生成一长就漂移竟是前序帧「太干净」惹的祸!研究揭示共享噪声水平才是长视频稳定关键
量子位· 2026-03-17 04:13
Core Insights - The article discusses the challenges of autoregressive (AR) video generation, particularly the issue of error accumulation leading to drift and degradation in video quality over long sequences. It introduces HiAR, a new model designed to address these issues effectively [3][4][12]. Group 1: Problem Identification - The main challenge in AR video generation is the inconsistency between training and inference, which results in accumulated errors and significant drift in longer video sequences [3][4]. - Existing methods to mitigate drift, such as simulating prediction errors and using first frame sinks, have limitations that hinder their effectiveness [3][18]. Group 2: HiAR Model Introduction - HiAR is a collaborative effort from researchers at multiple institutions aimed at exploring the reasons behind drift and providing an efficient solution [5]. - The model re-evaluates the necessity of completely denoised previous frames, proposing a hierarchical denoising framework that allows for causal generation without waiting for prior frames to be fully denoised [9][10]. Group 3: Technical Innovations - HiAR maintains a shared noise level across all video blocks during the denoising process, significantly reducing error propagation between blocks and enabling pipeline parallel inference [9][16]. - The introduction of forward KL regularization during training helps maintain dynamic diversity in generated videos, preventing the model from producing static, low-motion outputs [10][11]. Group 4: Performance Evaluation - HiAR demonstrated superior performance in the VBench long video benchmark, achieving a drift score of 0.257, which is significantly lower than baseline methods, while maintaining high visual quality and semantic stability [13][14]. - The model can generate high-quality continuous videos for extended durations, achieving 3 hours of video generation from just 5 seconds of training data, although some semantic continuity issues remain due to the absence of external memory modules [15]. Group 5: Engineering Advantages - HiAR's hierarchical denoising architecture allows for approximately 1.8 times faster inference without compromising video quality, achieving a throughput of 30 frames per second with a low latency of 0.30 seconds per chunk [16].
MLSys 2026 | StreamDiffusionV2: 将视频生成从「离线生成」带入「实时交互」,实现真正可用的生成式直播系统
机器之心· 2026-03-13 10:41
Core Insights - The article discusses the advancements in real-time video live streaming content creation through diffusion models, particularly focusing on the StreamDiffusionV2 system, which addresses the challenges of latency and quality in interactive live generation [2][29]. Group 1: System Overview - StreamDiffusionV2 is an open-source, interactive live video generation system that operates stably on various GPU types, achieving low latency and high-quality output [3][29]. - The system can maintain 16 FPS real-time inference on devices equipped with dual RTX 4090 cards, with first-frame latency under 0.5 seconds on H100 GPUs [3][29]. Group 2: Challenges in Existing Models - Current video diffusion models are primarily optimized for offline generation, leading to high first-frame latency and difficulties in meeting strict service level objectives (SLO) for live streaming [11]. - Issues such as temporal drift during long-term generation, motion blur, and tearing during fast actions are prevalent in existing models, necessitating a redesign focused on real-time constraints [11][7]. Group 3: Performance Bottlenecks - The performance of existing systems is limited by memory bandwidth rather than computational power, particularly in autoregressive video generation [13][14]. - Communication overhead from sequence parallelism in multi-GPU setups further exacerbates performance issues, making it challenging to achieve real-time interactive generation [13][14]. Group 4: Proposed Solutions - The research team introduces a dual optimization approach, addressing both algorithmic and system-level challenges to enhance real-time video generation [15][16]. - Key innovations include a pipeline-based batch denoising strategy and a dynamic noise adjustment mechanism based on motion intensity, which collectively improve generation quality and consistency [17][18]. Group 5: Experimental Results - StreamDiffusionV2 demonstrates a balance between low latency and high throughput, achieving significant performance improvements over previous models [22][26]. - The system exhibits tight latency distribution and low jitter, meeting sub-second real-time application requirements while maintaining high-quality generation [26][27]. Group 6: Future Implications - StreamDiffusionV2 bridges the gap between offline video diffusion and real-time live streaming, marking a significant step towards feasible high-quality generative live broadcasts [29][34]. - The design philosophy of focusing on memory access and real-time constraints is expected to shape the future of generative services in the industry [33][35].
Self-Forcing++:让自回归视频生成模型突破 4 分钟时长极限
机器之心· 2025-10-18 08:30
Core Insights - The article discusses the breakthrough of Self-Forcing++ in generating high-quality long videos, extending the generation time from 5 seconds to 4 minutes without requiring additional long video data for retraining [2][10]. Group 1: Challenges in Long Video Generation - Long video generation has been limited to a few seconds due to inherent architectural flaws in existing models, which struggle to maintain visual consistency and motion coherence beyond 10 seconds [6][7]. - The primary challenge lies in the models' inability to handle cumulative errors over extended sequences, leading to issues like overexposure and freezing [17][20]. Group 2: Key Innovations of Self-Forcing++ - Self-Forcing++ employs a unique approach where a teacher model, despite only generating 5-second videos, can correct distortions in longer videos generated by a student model [9][10]. - The process involves a cycle of generation, distortion, correction, and learning, allowing the model to self-repair and stabilize over longer time scales [10]. Group 3: Technical Mechanisms - Backward Noise Initialization allows the model to inject noise into already generated sequences, maintaining temporal continuity [13][15]. - Extended DMD expands the teacher-student distribution alignment to a sliding window, enabling local supervision of long video sequences [16][18]. - Rolling KV Cache aligns training and inference phases, eliminating issues like exposure drift and frame repetition [19][20]. Group 4: Experimental Results - Self-Forcing++ outperforms baseline models in generating videos of 50, 75, and 100 seconds, demonstrating superior stability and quality [23][24]. - The model maintains consistent brightness and natural motion across long videos, with minimal degradation in visual quality [30]. Group 5: Scaling and Future Improvements - The relationship between computational power and video length is explored, showing that increasing training resources significantly enhances video quality [31]. - Despite advancements, challenges remain in long-term memory retention and training efficiency, indicating areas for further development [33].
快手可灵团队提出MIDAS:压缩比64倍、延迟低于500ms,多模态互动数字人框架实现交互生成新突破
机器之心· 2025-09-13 08:54
Core Viewpoint - The article discusses the rapid development of digital human video generation technology, highlighting the introduction of the MIDAS framework by Kuaishou's Kling Team, which addresses significant challenges in real-time, multimodal control, and long-term consistency in digital human interactions [2][16]. Group 1: MIDAS Framework Overview - MIDAS (Multimodal Interactive Digital-human Synthesis) combines autoregressive video generation with lightweight diffusion denoising heads to achieve real-time, smooth digital human video synthesis under multimodal conditions [2][5]. - The system demonstrates three core advantages: high compression rates, low latency, and efficient denoising, making it suitable for real-time interactive applications [4][14]. Group 2: Technical Innovations - The framework utilizes a 64× compression ratio autoencoder, reducing each frame to a maximum of 60 tokens, significantly lowering computational load [4][8]. - MIDAS supports various input signals, including audio, posture, and text, through a unified multimodal condition projector that encodes different modalities into a shared latent space [5][12]. - The model architecture employs a Qwen2.5-3B autoregressive backbone with a diffusion head based on PixArt-α/mlp structure, ensuring coherence in generated outputs while minimizing computational delays [12][16]. Group 3: Training and Data - A large-scale multimodal dialogue dataset of approximately 20,000 hours was constructed to train the model, encompassing single and dual dialogue scenarios across multiple languages and styles [10][12]. - The training strategy includes controllable noise injection to mitigate exposure bias during inference, enhancing the model's performance [12]. Group 4: Application Scenarios - MIDAS can generate real-time dual-person dialogue, synchronizing lip movements, expressions, and listening postures with audio streams [13]. - The model achieves cross-language singing synthesis without explicit language identifiers, maintaining lip-sync across Chinese, Japanese, and English songs for videos up to 4 minutes long [13][14]. - MIDAS demonstrates potential as an interactive world model by responding to directional control signals in environments like Minecraft, showcasing scene consistency and memory capabilities [13][14]. Group 5: Future Directions - The team plans to explore higher resolution and more complex interaction logic in future developments, aiming to deploy the system in real product environments [17].
每秒生成超30帧视频,支持实时交互!自回归视频生成新框架刷新生成效率
量子位· 2025-06-12 01:37
Core Viewpoint - The article discusses the advancements in video generation technology through the introduction of the Next-Frame Diffusion (NFD) framework developed by a collaboration between Microsoft Research and Peking University, which significantly enhances both the quality and efficiency of video generation [1][2]. Group 1: Video Generation Efficiency - NFD achieves video generation at over 30 frames per second while maintaining high quality, utilizing NVIDIA A100 GPUs [1][4]. - The framework allows for frame-wise parallel sampling and inter-frame autoregressive generation, leading to a substantial increase in generation efficiency [2][18]. - Compared to previous models, NFD can generate videos in approximately 0.48 seconds per frame on the A100 GPU [4]. Group 2: Technical Innovations - NFD employs a unique modeling approach using frame-wise bidirectional attention and inter-frame causal attention mechanisms, which improves the modeling of temporal dependencies [21][25]. - The architecture includes a tokenizer for converting visual signals into tokens and a diffusion-based transformer model that reduces computational costs by 50% compared to traditional 3D full attention methods [26][25]. - The training process is based on Flow Matching, which simplifies the training of continuous time consistency models for video data [27][28]. Group 3: Performance Comparison - NFD outperforms previous autoregressive models in multiple metrics, achieving a Fréchet Video Distance (FVD) of 212 and a Peak Signal-to-Noise Ratio (PSNR) of 16.46, while running at 6.15 frames per second [35]. - The accelerated version, NFD+, achieves even higher performance with 42.46 FPS for the 130M model and 31.14 FPS for the 310M model, while maintaining competitive visual quality [36][37]. - NFD+ retains a PSNR of 16.83 and an FVD of 227, comparable to larger models like MineWorld [37]. Group 4: Future Implications - The advancements in video generation models, such as NFD, indicate a growing trend towards more flexible and efficient generation paradigms, which could lead to innovative applications in gaming and interactive media [15][35]. - The research highlights the potential for direct interaction between players and models in gaming environments, moving away from traditional game engines [3][15].