用短视频成本生成长视频，字节Seed新注意力机制让计算量降低85%

Core Insights - ByteSeed, in collaboration with Stanford researchers, has introduced a new model that significantly reduces the computational cost of generating long videos by 85% while maintaining quality and coherence in characters and scenes [1][3]. Group 1: Technology Overview - The new model employs a sparse attention mechanism called Mixture of Contexts (MoC), which redefines long video generation as a context retrieval task [1][3]. - MoC allows for the generation of a one-minute 480P video with only 2.32×10¹² FLOPs, compared to the baseline model's 1.66×10¹³ FLOPs, achieving an 85% reduction in computational load [3]. - For shorter videos, MoC also demonstrates cost-saving capabilities, with a multi-shot 64-second 480P video requiring only 2.3×10² FLOPs, saving approximately 86% compared to the baseline [3]. Group 2: Mechanism Details - MoC's core mechanism involves segmenting cross-modal sequences into semantically homogeneous content blocks, enhancing retrieval accuracy and reducing unnecessary computations [4][6]. - The model utilizes a dynamic top-k routing process, where only the most relevant blocks are retained for attention, optimizing the computational efficiency without adding parameters [6][7]. - To prevent information retention and ensure smooth long-range dynamics, strict temporal masks are implemented, prohibiting queries from accessing their own or subsequent blocks [6][7]. Group 3: Performance Metrics - The MoC method outperforms baseline models in various performance metrics, including theme consistency, background coherence, action continuity, and image quality [3][4]. - In a single-shot 8-second 320×192 video test, MoC required 4.1×10⁹ FLOPs, representing a reduction of approximately 78% compared to the baseline's 1.9×10¹⁰ FLOPs [3]. Group 4: Engineering Implementation - MoC integrates selected key values into FlashAttention variable-length kernels, enabling linear scalability for millions of tokens and efficient parallel processing on GPUs [6][7]. - The model ensures that all visual tokens can access complete text prompts, maintaining thematic consistency and enhancing editability [7].