LINVIDEO
Search documents
10秒视频token超5万,O(n²)跑不动?用后训练线性化框架实现1.71倍加速,推理成本大降|CVPR'2026
量子位· 2026-03-10 02:13
Core Viewpoint - The article discusses the challenges and advancements in video diffusion models, particularly focusing on the introduction of LINVIDEO, a framework that enables significant linearization of video generation processes without the need for data or retraining, while maintaining quality [3][25]. Group 1: Challenges in Video Diffusion Models - Video generation has entered a large-scale era, but the computational costs have surged significantly [1]. - The complexity of self-attention in video generation is O(n²), making it difficult to run efficiently, especially with token counts exceeding 50,000 for a 10-second video [2]. - The difficulty in linearizing video diffusion models arises from the sensitivity of the replacement process, where different layers have varying impacts on the final generation quality [7]. Group 2: LINVIDEO Framework - LINVIDEO is a post-training framework that allows for high-proportion linearization of video diffusion models while preserving generation quality [3][6]. - The framework employs a selective transfer approach, treating layer selection as a binary decision problem, allowing the model to learn which layers can be safely linearized [15][25]. - Additionally, LINVIDEO introduces anytime distribution matching (ADM), which aligns sample distributions across any timestep, enhancing efficiency and effectiveness without the need for auxiliary models [15][25]. Group 3: Experimental Results - LINVIDEO demonstrated a 1.71× end-to-end acceleration on the Wan 14B model, and with a 4-step distillation, it achieved up to 20.9× acceleration while maintaining nearly the same video quality [6][19]. - The performance comparison with other methods showed that LINVIDEO achieved a latency of 68.26 seconds with a 1.43× speedup on the Wan 1.3B model, and 1127 seconds with a 1.71× speedup on the Wan 14B model [17][19]. - Overall, LINVIDEO provides a practical solution for the linearization of video diffusion models, moving from O(n²) to a more scalable O(n) inference path [25].