Workflow
交互式视频生成
icon
Search documents
上下文即记忆!港大&快手提出场景一致的交互式视频世界模型,记忆力媲美Genie3,且更早问世!
量子位· 2025-08-21 07:15
Core Viewpoint - The article discusses a new framework called "Context-as-Memory" developed by a research team from the University of Hong Kong and Kuaishou, which significantly improves scene consistency in interactive long video generation by efficiently utilizing historical context frames [8][10][19]. Summary by Sections Introduction to Context-as-Memory - The framework addresses the issue of scene inconsistency in AI-generated videos by using a memory retrieval system that selects relevant historical frames to maintain continuity [10][19]. Types of Memory in Video Generation - Two types of memory are identified: dynamic memory for short-term actions and behaviors, and static memory for scene-level and object-level information [12][13]. Key Concepts of Context-as-Memory - Long video generation requires long-term historical memory to maintain scene consistency over time [15]. - Memory retrieval is crucial, as directly using all historical frames is computationally expensive; a memory retrieval module is needed to filter useful information [15]. - Context memory is created by concatenating selected context frames with the input, allowing the model to reference historical information during frame generation [15][19]. Memory Retrieval Method - The model employs a camera trajectory-based search method to select context frames that overlap significantly with the current frame's visible area, enhancing both computational efficiency and scene consistency [20][22]. Dataset and Experimental Results - A dataset was created using Unreal Engine 5, containing 100 videos with 7601 frames each, to evaluate the effectiveness of the Context-as-Memory method [23]. - Experimental results show that Context-as-Memory outperforms baseline and state-of-the-art methods in memory capability and generation quality, demonstrating its effectiveness in maintaining long video consistency [24][25]. Generalization of the Method - The method's generalization was tested using various styles of images as initial frames, confirming its strong memory capabilities in open-domain scenarios [26][27]. Research Team and Background - The research was a collaboration between the University of Hong Kong, Zhejiang University, and Kuaishou, led by PhD student Yu Jiwen under Professor Liu Xihui [28][33].
上下文记忆力媲美Genie3,且问世更早:港大和可灵提出场景一致的交互式视频世界模型
机器之心· 2025-08-21 01:03
Core Insights - The article discusses the development of video generation models that can maintain scene consistency over long durations, addressing the critical issue of stable scene memory in interactive long video generation [2][10][17] - Google DeepMind's Genie 3 is highlighted as a significant advancement in this field, demonstrating strong scene consistency, although technical details remain undisclosed [2][10] - The Context as Memory paper from a research team at Hong Kong University and Kuaishou is presented as a leading academic work that closely aligns with Genie 3's principles, emphasizing implicit learning of 3D priors from video data without explicit 3D modeling [2][10][17] Context as Memory Methodology - The Context as Memory approach utilizes historical generated context as memory, enabling scene-consistent long video generation without the need for explicit 3D modeling [10][17] - A Memory Retrieval mechanism is introduced to efficiently utilize theoretically infinite historical frame sequences by selecting relevant frames based on camera trajectory and field of view (FOV), significantly improving computational efficiency and reducing training costs [3][10][12] Experimental Results - Experimental comparisons show that Context as Memory outperforms existing state-of-the-art methods in maintaining scene memory during long video generation [15][17] - The model demonstrates superior performance in static scene memory retention over time and exhibits good generalization across different scenes [6][15] Broader Research Context - The research team has accumulated multiple studies in the realm of world models and interactive video generation, proposing a framework that outlines five foundational capabilities: Generation, Control, Memory, Dynamics, and Intelligence [18] - This framework serves as a guiding direction for future research in foundational world models, with Context as Memory being a focused contribution on memory capabilities [18]