长视频生成 - filings, earnings calls, financial reports, news

长视频生成

Search documents

3 6 Ke· 2025-09-05 08:41

Core Insights - VMem introduces a memory indexing system based on 3D geometry, replacing the traditional short-context approach, allowing for faster rendering and improved consistency in video generation models [1][5][15] - The system achieves a rendering speed of 4.2 seconds per frame, which is approximately 12 times faster than conventional methods that utilize a context window of 21 frames [1][17] Memory Indexing - The memory system uses surfels to index previously generated views, recording which frames have seen each surfel [2][10] - When a new viewpoint is introduced, the system retrieves the most frequently referenced frames for rendering, enhancing reliability through explicit occlusion modeling [3][10] Performance Metrics - VMem demonstrates superior performance in long-sequence revisits, particularly in the proposed cycle-trajectory evaluation, showing significant stability when returning to the same location [3][15] - In comparative evaluations, VMem outperforms existing models like LookOut and GenWarp across various metrics, including PSNR and LPIPS, indicating better visual consistency [15][16] Integration and Usability - The memory module can be integrated into existing image generation frameworks, maintaining performance metrics while reducing context from K=17 to K=4 [4][10] - VMem serves as an external memory system, allowing for efficient retrieval of relevant frames based on geometric visibility, which reduces computational overhead [10][11] Experimental Results - The evaluation settings involved generating images along true camera trajectories, with VMem showing consistent performance over long sequences [12][18] - Results indicate that VMem maintains long-term consistency and can effectively decouple memory capacity from the number of steps taken [16][18] Limitations and Future Directions - The current implementation is not real-time due to diffusion sampling requiring multiple steps, with potential for acceleration through advanced models and increased computational power [18] - Generalization to natural landscapes and dynamic objects remains an area for further exploration, as the current fine-tuning primarily focuses on indoor environments [18]

长视频生成

记忆增稳

VMem（Surfel-Indexed View Memory）

长视频生成

记忆增稳

VMem（Surfel-Indexed View Memory）

用短视频成本生成长视频，字节Seed新注意力机制让计算量降低85%

Sou Hu Cai Jing· 2025-09-02 05:45

Core Insights - ByteSeed, in collaboration with Stanford researchers, has introduced a new model that significantly reduces the computational cost of generating long videos by 85% while maintaining quality and coherence in characters and scenes [1][3]. Group 1: Technology Overview - The new model employs a sparse attention mechanism called Mixture of Contexts (MoC), which redefines long video generation as a context retrieval task [1][3]. - MoC allows for the generation of a one-minute 480P video with only 2.32×10¹² FLOPs, compared to the baseline model's 1.66×10¹³ FLOPs, achieving an 85% reduction in computational load [3]. - For shorter videos, MoC also demonstrates cost-saving capabilities, with a multi-shot 64-second 480P video requiring only 2.3×10² FLOPs, saving approximately 86% compared to the baseline [3]. Group 2: Mechanism Details - MoC's core mechanism involves segmenting cross-modal sequences into semantically homogeneous content blocks, enhancing retrieval accuracy and reducing unnecessary computations [4][6]. - The model utilizes a dynamic top-k routing process, where only the most relevant blocks are retained for attention, optimizing the computational efficiency without adding parameters [6][7]. - To prevent information retention and ensure smooth long-range dynamics, strict temporal masks are implemented, prohibiting queries from accessing their own or subsequent blocks [6][7]. Group 3: Performance Metrics - The MoC method outperforms baseline models in various performance metrics, including theme consistency, background coherence, action continuity, and image quality [3][4]. - In a single-shot 8-second 320×192 video test, MoC required 4.1×10⁹ FLOPs, representing a reduction of approximately 78% compared to the baseline's 1.9×10¹⁰ FLOPs [3]. Group 4: Engineering Implementation - MoC integrates selected key values into FlashAttention variable-length kernels, enabling linear scalability for millions of tokens and efficient parallel processing on GPUs [6][7]. - The model ensures that all visual tokens can access complete text prompts, maintaining thematic consistency and enhancing editability [7].

Mixture of Contexts（MoC）

Mixture of Contexts（MoC）

用短视频成本生成长视频，字节Seed新注意力机制让计算量降低85%

量子位· 2025-09-02 04:17

Core Viewpoint - The article discusses a new model developed by ByteSeed in collaboration with Stanford researchers that significantly reduces the computational cost of generating long videos while maintaining quality and coherence [1][2]. Group 1: Cost Reduction in Video Generation - The new model allows for the generation of long videos at a cost comparable to that of short videos, achieving an 85% reduction in computational requirements [1][10]. - For example, generating a one-minute 480P video using the Mixture of Contexts (MoC) mechanism requires only 2.32×10¹² FLOPs, compared to 1.66×10¹³ FLOPs for the baseline model [10]. - The MoC mechanism also demonstrates similar cost-saving effects for short videos, with a 64-second multi-shot video requiring 2.3×10¹² FLOPs versus 1.7×10¹³ FLOPs for the baseline, resulting in approximately 86% savings [11]. Group 2: Quality and Consistency - The generated long videos maintain subject and background consistency, motion smoothness, and overall image quality, outperforming the baseline model across various performance metrics [12]. - In a single-shot 8-second 320×192 video test, the MoC model achieved a reduction of approximately 78% in computational load, requiring only 4.1×10⁹ FLOPs compared to 1.9×10¹⁰ FLOPs for the baseline [14]. Group 3: Mechanism of MoC - The MoC mechanism redefines long video generation as an information retrieval task, focusing on efficient cross-temporal memory retrieval [3][15]. - It employs a sparse attention mechanism that segments video sequences into semantically homogeneous content blocks, allowing each query token to connect only with the most relevant blocks [15][16]. - The model incorporates a "content alignment chunking" process to enhance retrieval accuracy and reduce unnecessary computational waste [19]. Group 4: Engineering Implementation - The MoC model is designed to prevent information retention issues by enforcing strict temporal masks during the routing phase, ensuring that queries do not access future blocks [20]. - The implementation utilizes FlashAttention for efficient memory access and parallel processing on GPUs, allowing for scalable performance with millions of tokens [20].

长视频生成

稀疏注意力机制

人工智能

Mixture of Contexts（MoC）

长视频生成

稀疏注意力机制

人工智能

Mixture of Contexts（MoC）