长视频生成可以回头看了，牛津提出「记忆增稳」，速度提升12倍

Core Insights - VMem introduces a memory indexing system based on 3D geometry, replacing the traditional short-context approach, allowing for faster rendering and improved consistency in video generation models [1][5][15] - The system achieves a rendering speed of 4.2 seconds per frame, which is approximately 12 times faster than conventional methods that utilize a context window of 21 frames [1][17] Memory Indexing - The memory system uses surfels to index previously generated views, recording which frames have seen each surfel [2][10] - When a new viewpoint is introduced, the system retrieves the most frequently referenced frames for rendering, enhancing reliability through explicit occlusion modeling [3][10] Performance Metrics - VMem demonstrates superior performance in long-sequence revisits, particularly in the proposed cycle-trajectory evaluation, showing significant stability when returning to the same location [3][15] - In comparative evaluations, VMem outperforms existing models like LookOut and GenWarp across various metrics, including PSNR and LPIPS, indicating better visual consistency [15][16] Integration and Usability - The memory module can be integrated into existing image generation frameworks, maintaining performance metrics while reducing context from K=17 to K=4 [4][10] - VMem serves as an external memory system, allowing for efficient retrieval of relevant frames based on geometric visibility, which reduces computational overhead [10][11] Experimental Results - The evaluation settings involved generating images along true camera trajectories, with VMem showing consistent performance over long sequences [12][18] - Results indicate that VMem maintains long-term consistency and can effectively decouple memory capacity from the number of steps taken [16][18] Limitations and Future Directions - The current implementation is not real-time due to diffusion sampling requiring multiple steps, with potential for acceleration through advanced models and increased computational power [18] - Generalization to natural landscapes and dynamic objects remains an area for further exploration, as the current fine-tuning primarily focuses on indoor environments [18]