SNW-李飞飞发布全新世界模型，单GPU就能跑

Core Insights - The newly launched RTFM (A Real-Time Frame Model) by Fei-Fei Li is designed to operate in real-time with persistence and 3D consistency, requiring only a single H100 GPU for operation [1][10] - RTFM is built on three core principles: efficiency, scalability, and persistence, allowing for real-time inference at interactive frame rates, continuous expansion with data and computational power, and permanent retention of all scenes [1][6] Group 1: Model Capabilities - RTFM can generate and simulate a persistent, interactive, and physically accurate world, which has the potential to transform various industries from media to robotics [3][5] - The model's efficiency allows it to perform real-time inference with just one H100 GPU, making it immediately deployable while ensuring that the virtual world remains intact during user interactions [1][6] Group 2: Technical Innovations - RTFM utilizes a novel approach by training a single neural network to generate 2D images from 2D inputs without requiring explicit 3D representations, thus simplifying the modeling process [7][8] - The model employs a self-regressive diffusion transformer architecture, trained end-to-end on vast video data, enabling it to predict subsequent frames based on historical data [7][8] Group 3: Memory and Persistence - RTFM addresses the challenge of persistence by modeling each frame with a spatial pose, allowing the model to maintain a memory of the world without the need for explicit 3D geometry [9][10] - The concept of context juggling enables the model to generate content in different spatial areas using varying contextual frames, thus maintaining a long-term memory of large worlds during extended interactions [10]