ControlNet作者张吕敏最新论文：长视频也能实现超短上下文

Core Viewpoint - The article discusses the limitations of current high-quality video generation models, which can only produce videos of approximately 15 seconds in length, and the challenges faced by creators in achieving their creative visions due to the need for segment generation and maintaining visual consistency [1][4]. Group 1: Limitations and Challenges - The bottleneck in video generation length is attributed to the internal breakdown of a 60-second video into over 500,000 "potential tokens," which complicates maintaining narrative coherence and visual consistency [2][3]. - The core contradiction of autoregressive video generation models lies in the trade-off between longer context for coherence and the increased computational cost associated with it [4][5]. - Compression methods often sacrifice high-frequency details that are crucial for visual realism and consistency, leading to a significant challenge in video generation [6]. Group 2: Proposed Solutions - A research team led by Zhang Lumin from Suzhou University and Stanford University has proposed a new memory compression system designed specifically for long videos, aiming to retain fine visual details during compression [6][7]. - The proposed neural network structure can compress a 20-second video into a context representation of approximately 5,000 tokens while maintaining good perceptual quality [8]. Group 3: Methodology - The research employs a two-stage strategy, first pre-training a dedicated memory compression model to preserve high-fidelity frame-level details at any historical time position [11][15]. - The model's pre-training objective is to minimize feature distance for randomly sampled frames from the compressed history, ensuring robust detail encoding across the entire sequence [12][16]. - The architecture utilizes a lightweight dual-path structure to process both low-resolution video streams and high-resolution residual information, enhancing detail fidelity [12][23]. Group 4: Experimental Results - The experiments utilized an 8 × H100 GPU cluster for pre-training and demonstrated the model's ability to handle diverse prompts and maintain consistency in characters, scenes, objects, and plotlines [30][34]. - Quantitative evaluations showed that the proposed method achieved competitive scores in various consistency metrics, with the Wan+Qwen combination leading in instance scores [35][36]. - Ablation studies indicated that the proposed method outperformed others in PSNR and SSIM metrics, effectively preserving original image structure even under high compression rates [37][38].