Workflow
交互式视频生成
icon
Search documents
上下文即记忆!港大&快手提出场景一致的交互式视频世界模型,记忆力媲美Genie3,且更早问世!
量子位· 2025-08-21 07:15
Core Viewpoint - The article discusses a new framework called "Context-as-Memory" developed by a research team from the University of Hong Kong and Kuaishou, which significantly improves scene consistency in interactive long video generation by efficiently utilizing historical context frames [8][10][19]. Summary by Sections Introduction to Context-as-Memory - The framework addresses the issue of scene inconsistency in AI-generated videos by using a memory retrieval system that selects relevant historical frames to maintain continuity [10][19]. Types of Memory in Video Generation - Two types of memory are identified: dynamic memory for short-term actions and behaviors, and static memory for scene-level and object-level information [12][13]. Key Concepts of Context-as-Memory - Long video generation requires long-term historical memory to maintain scene consistency over time [15]. - Memory retrieval is crucial, as directly using all historical frames is computationally expensive; a memory retrieval module is needed to filter useful information [15]. - Context memory is created by concatenating selected context frames with the input, allowing the model to reference historical information during frame generation [15][19]. Memory Retrieval Method - The model employs a camera trajectory-based search method to select context frames that overlap significantly with the current frame's visible area, enhancing both computational efficiency and scene consistency [20][22]. Dataset and Experimental Results - A dataset was created using Unreal Engine 5, containing 100 videos with 7601 frames each, to evaluate the effectiveness of the Context-as-Memory method [23]. - Experimental results show that Context-as-Memory outperforms baseline and state-of-the-art methods in memory capability and generation quality, demonstrating its effectiveness in maintaining long video consistency [24][25]. Generalization of the Method - The method's generalization was tested using various styles of images as initial frames, confirming its strong memory capabilities in open-domain scenarios [26][27]. Research Team and Background - The research was a collaboration between the University of Hong Kong, Zhejiang University, and Kuaishou, led by PhD student Yu Jiwen under Professor Liu Xihui [28][33].
上下文记忆力媲美Genie3,且问世更早:港大和可灵提出场景一致的交互式视频世界模型
机器之心· 2025-08-21 01:03
要让视频生成模型真正成为 模 拟真实物理世界的「世界 模型」 ,必须具备长时间生成并保留场景记忆的能力。然而,交互式长视频生成一直面临一个致命短 板: 缺乏稳定的场景记忆 。镜头稍作移动再转回,眼前景物就可能「换了个世界」。 这一问题严重制约了视频生成技术在游戏、自动驾驶、具身智能等下游应用的落地。8 月初,Google DeepMind 发布的 Genie 3 引爆 AI 圈,以其在长视频生成中依 旧保持极强场景一致性的能力,被视为世界模型领域的质变之作。不过遗憾的是, Genie 3 并未公开任何技术细节 。 来自 港大和快手可灵的研究团队 近期发表的 Context as Memory 论文,可能是目前学术界效果上最接近 Genie 3 的工作,且投稿时间早于 Genie 3 的发布。早在此 前研究中,团队就发现:视频生成模型能够 隐式学习视频数据中的 3D 先验,无需显式 3D 建模辅助 ,这与 Genie 3 的理念不谋而合。如下是一个结果展示: 技术上,团队创新性地提出将 历史生成的上下文作为「记忆」 (即 Context-as-Memory),利用 context learning 技术学习上下 ...