一夜颠覆Sora神话,H200单卡5秒出片,全华人团队开源AI引爆视频圈
3 6 Ke·2025-08-07 07:29

Core Insights - The article discusses a new training scheme called "sparse distillation" that significantly enhances video denoising speed by 70 times, enabling real-time AI video generation [2][8]. Group 1: Technology and Innovation - The FastWan2.1-1.3B model achieves video denoising in just 1 second on a single H200, generating a 480p video in 5 seconds [2]. - The upgraded FastWan2.2-5B can generate a 720p 5-second video in only 16 seconds on a single H200 [6]. - Sparse distillation combines sparse attention and denoising step distillation, allowing for a reduction in the number of denoising steps from 50 to as few as 3 while maintaining performance [11][14]. Group 2: Performance Metrics - The traditional video diffusion models require extensive denoising steps and face high computational costs due to attention mechanisms, which can consume over 85% of inference time [9]. - The new video sparse attention (VSA) mechanism allows for dynamic identification of key tokens in sequences, improving efficiency without sacrificing quality [13]. - The VSA mechanism achieved a reduction in attention computation FLOPS by 8 times while maintaining similar loss values compared to full attention mechanisms [19]. Group 3: Model Architecture - The architecture includes three components: a sparse student network driven by VSA, a frozen true scoring network, and a trainable pseudo scoring network [14][17]. - The training process involves the student network generating outputs that are then compared against the outputs of the two scoring networks to optimize performance [17]. - The model's efficiency is further enhanced by techniques such as parameter sharding across GPUs and activation checkpointing to manage memory usage [18]. Group 4: Results and Comparisons - The VSA mechanism outperformed traditional sparse attention methods, demonstrating superior performance even under extreme sparsity conditions [23]. - The inference time for the Wan-1.3B model was reduced from 31 seconds in full attention mode to 18 seconds in VSA mode [23]. - The VSA approach achieved a 6-fold acceleration compared to FlashAttention-3, showcasing its effectiveness in long sequence scenarios [25].