Workflow
上下文即记忆!港大&快手提出场景一致的交互式视频世界模型,记忆力媲美Genie3,且更早问世!
KUAISHOUKUAISHOU(HK:01024) 量子位·2025-08-21 07:15

Core Viewpoint - The article discusses a new framework called "Context-as-Memory" developed by a research team from the University of Hong Kong and Kuaishou, which significantly improves scene consistency in interactive long video generation by efficiently utilizing historical context frames [8][10][19]. Summary by Sections Introduction to Context-as-Memory - The framework addresses the issue of scene inconsistency in AI-generated videos by using a memory retrieval system that selects relevant historical frames to maintain continuity [10][19]. Types of Memory in Video Generation - Two types of memory are identified: dynamic memory for short-term actions and behaviors, and static memory for scene-level and object-level information [12][13]. Key Concepts of Context-as-Memory - Long video generation requires long-term historical memory to maintain scene consistency over time [15]. - Memory retrieval is crucial, as directly using all historical frames is computationally expensive; a memory retrieval module is needed to filter useful information [15]. - Context memory is created by concatenating selected context frames with the input, allowing the model to reference historical information during frame generation [15][19]. Memory Retrieval Method - The model employs a camera trajectory-based search method to select context frames that overlap significantly with the current frame's visible area, enhancing both computational efficiency and scene consistency [20][22]. Dataset and Experimental Results - A dataset was created using Unreal Engine 5, containing 100 videos with 7601 frames each, to evaluate the effectiveness of the Context-as-Memory method [23]. - Experimental results show that Context-as-Memory outperforms baseline and state-of-the-art methods in memory capability and generation quality, demonstrating its effectiveness in maintaining long video consistency [24][25]. Generalization of the Method - The method's generalization was tested using various styles of images as initial frames, confirming its strong memory capabilities in open-domain scenarios [26][27]. Research Team and Background - The research was a collaboration between the University of Hong Kong, Zhejiang University, and Kuaishou, led by PhD student Yu Jiwen under Professor Liu Xihui [28][33].