李飞飞发布全新世界模型，单GPU就能跑！

Core Insights - The article discusses the launch of a new model called RTFM (A Real-Time Frame Model) by Fei-Fei Li, which operates in real-time, has persistence, and maintains 3D consistency, all while being able to run on a single H100 GPU [1][2]. Group 1: Model Features - RTFM is designed with three core principles: efficiency, scalability, and persistence. It can perform real-time inference at interactive frame rates using only one H100 GPU [2]. - The model is capable of continuous interaction with users, allowing all scenes to be permanently stored, thus creating a persistent 3D world that does not disappear with changes in perspective [3]. Group 2: Computational Requirements - Powerful world models require significant computational resources to reconstruct, generate, and simulate persistent, interactive, and physically accurate environments, which could revolutionize various industries from media to robotics [5]. - The demand for computational power in generative world modeling is expected to exceed that of current large language models, with the need to generate over 100,000 tokens per second for 60 frames of 4K interactive video [7][8]. Group 3: Design Philosophy - The team believes that methods that can elegantly scale with increasing computational power will dominate the AI field, benefiting from the exponential decrease in computing costs over decades [9]. - The goal was to create a highly efficient generative world model that can be deployed immediately and can scale with increased computational power, all while being driven by a single H100 GPU [10]. Group 4: Learning Renderer - RTFM employs a novel approach by using a single neural network to generate 2D images from one or more input images without relying on explicit 3D representations [12]. - The model utilizes an autoregressive diffusion transformer architecture trained on vast amounts of video data, allowing it to predict subsequent frames based on historical data [13]. Group 5: Memory and Persistence - RTFM addresses the challenge of persistence by modeling each frame with a pose in 3D space, allowing the generation of new frames based on the provided pose [18]. - The model's memory structure is spatially organized, enabling it to maintain a persistent memory of the world without explicitly predicting the 3D geometry of objects [19]. - The technique of context juggling allows RTFM to maintain long-term memory of large worlds during extended interactions without the need for extensive computational resources [20].