层次去噪框架
Search documents
视频生成一长就漂移竟是前序帧「太干净」惹的祸!研究揭示共享噪声水平才是长视频稳定关键
量子位· 2026-03-17 04:13
Core Insights - The article discusses the challenges of autoregressive (AR) video generation, particularly the issue of error accumulation leading to drift and degradation in video quality over long sequences. It introduces HiAR, a new model designed to address these issues effectively [3][4][12]. Group 1: Problem Identification - The main challenge in AR video generation is the inconsistency between training and inference, which results in accumulated errors and significant drift in longer video sequences [3][4]. - Existing methods to mitigate drift, such as simulating prediction errors and using first frame sinks, have limitations that hinder their effectiveness [3][18]. Group 2: HiAR Model Introduction - HiAR is a collaborative effort from researchers at multiple institutions aimed at exploring the reasons behind drift and providing an efficient solution [5]. - The model re-evaluates the necessity of completely denoised previous frames, proposing a hierarchical denoising framework that allows for causal generation without waiting for prior frames to be fully denoised [9][10]. Group 3: Technical Innovations - HiAR maintains a shared noise level across all video blocks during the denoising process, significantly reducing error propagation between blocks and enabling pipeline parallel inference [9][16]. - The introduction of forward KL regularization during training helps maintain dynamic diversity in generated videos, preventing the model from producing static, low-motion outputs [10][11]. Group 4: Performance Evaluation - HiAR demonstrated superior performance in the VBench long video benchmark, achieving a drift score of 0.257, which is significantly lower than baseline methods, while maintaining high visual quality and semantic stability [13][14]. - The model can generate high-quality continuous videos for extended durations, achieving 3 hours of video generation from just 5 seconds of training data, although some semantic continuity issues remain due to the absence of external memory modules [15]. Group 5: Engineering Advantages - HiAR's hierarchical denoising architecture allows for approximately 1.8 times faster inference without compromising video quality, achieving a throughput of 30 frames per second with a low latency of 0.30 seconds per chunk [16].