生成式世界模型
Search documents
李飞飞发布全新世界模型,单GPU就能跑
3 6 Ke· 2025-10-17 01:45
Core Insights - The newly launched RTFM (A Real-Time Frame Model) by Fei-Fei Li is designed to operate in real-time with persistence and 3D consistency, requiring only a single H100 GPU for operation [1][10] - RTFM is built on three core principles: efficiency, scalability, and persistence, allowing for real-time inference at interactive frame rates, continuous expansion with data and computational power, and permanent retention of all scenes [1][6] Group 1: Model Capabilities - RTFM can generate and simulate a persistent, interactive, and physically accurate world, which has the potential to transform various industries from media to robotics [3][5] - The model's efficiency allows it to perform real-time inference with just one H100 GPU, making it immediately deployable while ensuring that the virtual world remains intact during user interactions [1][6] Group 2: Technical Innovations - RTFM utilizes a novel approach by training a single neural network to generate 2D images from 2D inputs without requiring explicit 3D representations, thus simplifying the modeling process [7][8] - The model employs a self-regressive diffusion transformer architecture, trained end-to-end on vast video data, enabling it to predict subsequent frames based on historical data [7][8] Group 3: Memory and Persistence - RTFM addresses the challenge of persistence by modeling each frame with a spatial pose, allowing the model to maintain a memory of the world without the need for explicit 3D geometry [9][10] - The concept of context juggling enables the model to generate content in different spatial areas using varying contextual frames, thus maintaining a long-term memory of large worlds during extended interactions [10]
李飞飞发布全新世界模型,单GPU就能跑!
量子位· 2025-10-17 01:04
Core Insights - The article discusses the launch of a new model called RTFM (A Real-Time Frame Model) by Fei-Fei Li, which operates in real-time, has persistence, and maintains 3D consistency, all while being able to run on a single H100 GPU [1][2]. Group 1: Model Features - RTFM is designed with three core principles: efficiency, scalability, and persistence. It can perform real-time inference at interactive frame rates using only one H100 GPU [2]. - The model is capable of continuous interaction with users, allowing all scenes to be permanently stored, thus creating a persistent 3D world that does not disappear with changes in perspective [3]. Group 2: Computational Requirements - Powerful world models require significant computational resources to reconstruct, generate, and simulate persistent, interactive, and physically accurate environments, which could revolutionize various industries from media to robotics [5]. - The demand for computational power in generative world modeling is expected to exceed that of current large language models, with the need to generate over 100,000 tokens per second for 60 frames of 4K interactive video [7][8]. Group 3: Design Philosophy - The team believes that methods that can elegantly scale with increasing computational power will dominate the AI field, benefiting from the exponential decrease in computing costs over decades [9]. - The goal was to create a highly efficient generative world model that can be deployed immediately and can scale with increased computational power, all while being driven by a single H100 GPU [10]. Group 4: Learning Renderer - RTFM employs a novel approach by using a single neural network to generate 2D images from one or more input images without relying on explicit 3D representations [12]. - The model utilizes an autoregressive diffusion transformer architecture trained on vast amounts of video data, allowing it to predict subsequent frames based on historical data [13]. Group 5: Memory and Persistence - RTFM addresses the challenge of persistence by modeling each frame with a pose in 3D space, allowing the generation of new frames based on the provided pose [18]. - The model's memory structure is spatially organized, enabling it to maintain a persistent memory of the world without explicitly predicting the 3D geometry of objects [19]. - The technique of context juggling allows RTFM to maintain long-term memory of large worlds during extended interactions without the need for extensive computational resources [20].
全球首款AI原生游戏引擎再进化:GTA6再不来,我们就AI一个
3 6 Ke· 2025-08-22 09:17
Core Insights - The article discusses the advancements in the AI-driven game engine, Mirage 2, which has evolved significantly from its predecessor, Mirage 1, in just over a month [2][4][17]. Group 1: Mirage 2 Features - Mirage 2 is described as a generative world engine that allows users to create, experience, and modify any interactive world, not limited to gaming [2][4]. - It supports image uploads to convert them into interactive game worlds and allows real-time dialogue for modifying the game environment through text commands [5][11]. - The engine has improved performance metrics, including faster prompt control, reduced game latency to 200ms, and the ability to run on a single consumer GPU [5][14][13]. Group 2: Comparison with Competitors - Mirage 2 is positioned to compete with DeepMind's Genie 3, offering more interactive capabilities such as running, jumping, and attacking, with a longer interaction horizon of over 10 minutes [11][13]. - The article highlights that Mirage 2 has made significant improvements in object proportions and scene understanding compared to Mirage 1, achieving a more realistic representation of characters and vehicles [14][17]. Group 3: Technical Challenges - Despite the advancements, there are still technical issues to address, such as action control precision and visual consistency during rapid scene changes [16][17]. - The article notes that while Mirage 2 has made strides, it still falls short of the consistency demonstrated by Genie 3, indicating areas for further development [16][17].
SceneDiffuser++:基于生成世界模型的城市规模交通仿真(CVPR'25)
自动驾驶之心· 2025-07-21 11:18
Core Viewpoint - The article discusses the development of SceneDiffuser++, a generative world model that enables city-scale traffic simulation, addressing the unique challenges of trip-level simulation compared to event-level simulation [1][2]. Group 1: Introduction and Background - The primary goal of traffic simulation is to supplement limited real-world driving data with extensive synthetic simulation mileage to support the testing and validation of autonomous driving systems [1]. - An ideal generative simulation city (CitySim) should seamlessly simulate a complete journey from point A to point B, managing dynamic elements such as vehicles, pedestrians, and traffic lights [1]. Group 2: Technical Integration - Achieving CitySim requires the integration of multiple technologies, including scene generation, agent behavior modeling, occlusion reasoning, dynamic scene generation, and environmental simulation [2]. - SceneDiffuser++ is the first end-to-end generative world model that consolidates these requirements through a single loss function, enabling complete simulation from A to B [2]. Group 3: Core Challenges and Innovations - Trip-level simulation faces three unique challenges compared to event-level simulation, including the need for dynamic agent management, occlusion reasoning, and environmental dynamics [3]. - SceneDiffuser++ introduces innovations such as multi-tensor diffusion, soft clipping strategies, and unified generative modeling to address these challenges [4][5]. Group 4: Methodology and Model Details - SceneDiffuser++ represents scenes as scene tensors, allowing the model to handle dynamic changes in heterogeneous elements like agents and traffic lights simultaneously [7]. - The model employs a diffusion process for training and inference, focusing on effective feature learning through loss masking and soft clipping to stabilize sparse tensor generation [8][9]. Group 5: Performance Evaluation - Experiments based on the WOMD-XLMap dataset demonstrate that SceneDiffuser++ outperforms previous models in all metrics, achieving lower Jensen-Shannon divergence values for agent generation and removal [12]. - The model maintains agent dynamics and traffic light realism over a 60-second simulation, contrasting with previous models that exhibited stagnation [15]. Group 6: Conclusion and Significance - The core contributions of SceneDiffuser++ include the introduction of the CitySim concept, the design of a unified generative framework, and the resolution of stability issues in dynamic scene generation through sparse tensor learning and soft clipping [19].