Mixture of Contexts(MoC)

Search documents
用短视频成本生成长视频,字节Seed新注意力机制让计算量降低85%
Sou Hu Cai Jing· 2025-09-02 05:45
Core Insights - ByteSeed, in collaboration with Stanford researchers, has introduced a new model that significantly reduces the computational cost of generating long videos by 85% while maintaining quality and coherence in characters and scenes [1][3]. Group 1: Technology Overview - The new model employs a sparse attention mechanism called Mixture of Contexts (MoC), which redefines long video generation as a context retrieval task [1][3]. - MoC allows for the generation of a one-minute 480P video with only 2.32×10¹² FLOPs, compared to the baseline model's 1.66×10¹³ FLOPs, achieving an 85% reduction in computational load [3]. - For shorter videos, MoC also demonstrates cost-saving capabilities, with a multi-shot 64-second 480P video requiring only 2.3×10² FLOPs, saving approximately 86% compared to the baseline [3]. Group 2: Mechanism Details - MoC's core mechanism involves segmenting cross-modal sequences into semantically homogeneous content blocks, enhancing retrieval accuracy and reducing unnecessary computations [4][6]. - The model utilizes a dynamic top-k routing process, where only the most relevant blocks are retained for attention, optimizing the computational efficiency without adding parameters [6][7]. - To prevent information retention and ensure smooth long-range dynamics, strict temporal masks are implemented, prohibiting queries from accessing their own or subsequent blocks [6][7]. Group 3: Performance Metrics - The MoC method outperforms baseline models in various performance metrics, including theme consistency, background coherence, action continuity, and image quality [3][4]. - In a single-shot 8-second 320×192 video test, MoC required 4.1×10⁹ FLOPs, representing a reduction of approximately 78% compared to the baseline's 1.9×10¹⁰ FLOPs [3]. Group 4: Engineering Implementation - MoC integrates selected key values into FlashAttention variable-length kernels, enabling linear scalability for millions of tokens and efficient parallel processing on GPUs [6][7]. - The model ensures that all visual tokens can access complete text prompts, maintaining thematic consistency and enhancing editability [7].
用短视频成本生成长视频,字节Seed新注意力机制让计算量降低85%
量子位· 2025-09-02 04:17
Core Viewpoint - The article discusses a new model developed by ByteSeed in collaboration with Stanford researchers that significantly reduces the computational cost of generating long videos while maintaining quality and coherence [1][2]. Group 1: Cost Reduction in Video Generation - The new model allows for the generation of long videos at a cost comparable to that of short videos, achieving an 85% reduction in computational requirements [1][10]. - For example, generating a one-minute 480P video using the Mixture of Contexts (MoC) mechanism requires only 2.32×10¹² FLOPs, compared to 1.66×10¹³ FLOPs for the baseline model [10]. - The MoC mechanism also demonstrates similar cost-saving effects for short videos, with a 64-second multi-shot video requiring 2.3×10¹² FLOPs versus 1.7×10¹³ FLOPs for the baseline, resulting in approximately 86% savings [11]. Group 2: Quality and Consistency - The generated long videos maintain subject and background consistency, motion smoothness, and overall image quality, outperforming the baseline model across various performance metrics [12]. - In a single-shot 8-second 320×192 video test, the MoC model achieved a reduction of approximately 78% in computational load, requiring only 4.1×10⁹ FLOPs compared to 1.9×10¹⁰ FLOPs for the baseline [14]. Group 3: Mechanism of MoC - The MoC mechanism redefines long video generation as an information retrieval task, focusing on efficient cross-temporal memory retrieval [3][15]. - It employs a sparse attention mechanism that segments video sequences into semantically homogeneous content blocks, allowing each query token to connect only with the most relevant blocks [15][16]. - The model incorporates a "content alignment chunking" process to enhance retrieval accuracy and reduce unnecessary computational waste [19]. Group 4: Engineering Implementation - The MoC model is designed to prevent information retention issues by enforcing strict temporal masks during the routing phase, ensuring that queries do not access future blocks [20]. - The implementation utilizes FlashAttention for efficient memory access and parallel processing on GPUs, allowing for scalable performance with millions of tokens [20].