李飞飞发布的单GPU推理世界模型，自动驾驶应用还会远吗？

Core Insights - The article discusses the launch of a new model called RTFM (A Real-Time Frame Model) by Fei-Fei Li, which is capable of real-time operation, persistence, and 3D consistency, and can run on a single H100 GPU [3][5][15] Group 1: Model Features - RTFM operates with high efficiency, requiring only one H100 GPU to perform inference at interactive frame rates [5] - The model is designed for scalability, allowing it to expand with increasing data and computational power without relying on explicit 3D representations [5][14] - RTFM enables users to interact indefinitely, with all scenes being permanently retained, ensuring that the constructed 3D world does not disappear with changes in perspective [6] Group 2: Computational Demands - The demand for computational resources in generative world modeling is significantly higher than that of current large language models [10] - Generating a 60-frame 4K interactive video stream requires over 100,000 tokens per second, and maintaining over an hour of continuous interaction could exceed 100 million tokens [11][12] - The team believes that methods that can elegantly scale with computational growth will dominate the AI field, benefiting from the decreasing costs of computational power [14] Group 3: Learning and Rendering - RTFM utilizes a novel approach by training a single neural network to generate 2D images from 2D inputs without constructing explicit 3D representations [17][19] - The model blurs the lines between "reconstruction" and "generation," allowing it to learn complex effects like reflections and shadows through end-to-end data training [21] - RTFM employs a spatial memory structure, using frames with poses to maintain persistence and context during interactions [26][27] Group 4: Availability - The RTFM model is now available in a preview version for users to experience and provide feedback [28]