用短视频成本生成长视频，字节Seed新注意力机制让计算量降低85%

Core Viewpoint - The article discusses a new model developed by ByteSeed in collaboration with Stanford researchers that significantly reduces the computational cost of generating long videos while maintaining quality and coherence [1][2]. Group 1: Cost Reduction in Video Generation - The new model allows for the generation of long videos at a cost comparable to that of short videos, achieving an 85% reduction in computational requirements [1][10]. - For example, generating a one-minute 480P video using the Mixture of Contexts (MoC) mechanism requires only 2.32×10¹² FLOPs, compared to 1.66×10¹³ FLOPs for the baseline model [10]. - The MoC mechanism also demonstrates similar cost-saving effects for short videos, with a 64-second multi-shot video requiring 2.3×10¹² FLOPs versus 1.7×10¹³ FLOPs for the baseline, resulting in approximately 86% savings [11]. Group 2: Quality and Consistency - The generated long videos maintain subject and background consistency, motion smoothness, and overall image quality, outperforming the baseline model across various performance metrics [12]. - In a single-shot 8-second 320×192 video test, the MoC model achieved a reduction of approximately 78% in computational load, requiring only 4.1×10⁹ FLOPs compared to 1.9×10¹⁰ FLOPs for the baseline [14]. Group 3: Mechanism of MoC - The MoC mechanism redefines long video generation as an information retrieval task, focusing on efficient cross-temporal memory retrieval [3][15]. - It employs a sparse attention mechanism that segments video sequences into semantically homogeneous content blocks, allowing each query token to connect only with the most relevant blocks [15][16]. - The model incorporates a "content alignment chunking" process to enhance retrieval accuracy and reduce unnecessary computational waste [19]. Group 4: Engineering Implementation - The MoC model is designed to prevent information retention issues by enforcing strict temporal masks during the routing phase, ensuring that queries do not access future blocks [20]. - The implementation utilizes FlashAttention for efficient memory access and parallel processing on GPUs, allowing for scalable performance with millions of tokens [20].