生成式世界模型
Search documents
World-in-World:约翰霍普金斯 × 北大联合提出闭环下的具身世界模型评估框架!
具身智能之心· 2025-10-26 04:02
Core Insights - The article emphasizes the need to redefine the evaluation of world models in embodied intelligence, focusing on their practical utility rather than just visual quality [2][23] - The introduction of the "World-in-World" platform aims to test world models in real embodied tasks through a closed-loop interaction system, addressing the gap between visual quality and task effectiveness [3][23] Evaluation Redefinition - Current evaluation systems prioritize visual clarity and scene rationality, often rewarding models that produce high-quality visuals without assessing their decision-making capabilities in real tasks [2][23] - The article highlights the importance of aligning actions and predictions in embodied tasks, where the model must accurately predict scene changes based on the agent's movements [2][3] World-in-World Platform Design - The platform creates a closed-loop system where the agent, world model, and environment interact in a cycle of observation, decision-making, execution, and re-observation [3][6] - A unified action API is established to standardize input across different world models, ensuring consistent interpretation of action intentions [6][12] Task Evaluation - Four types of real-world embodied tasks are selected for comprehensive testing, each with defined scenarios, objectives, and scoring criteria [10][14] - The platform incorporates post-training techniques to fine-tune models using task-specific data, enhancing their adaptability to real-world tasks [12][23] Experimental Findings - Experiments with 12 mainstream world models reveal that task data fine-tuning is more effective than simply using larger pre-trained models, demonstrating significant improvements in success rates [17][20] - The article notes that models with high visual quality do not necessarily perform better in practical tasks, emphasizing the importance of controllability over visual appeal [18][23] Recommendations for Future Development - The article suggests focusing on improving controllability, utilizing task data for low-cost enhancements, and addressing the shortcomings in physical modeling for operational tasks [23][22]
李飞飞发布的单GPU推理世界模型,自动驾驶应用还会远吗?
自动驾驶之心· 2025-10-21 00:06
被李飞飞的最新的世界模型刷屏了。 编辑 | 量子位 来源 | 李飞飞发布全新世界模型,单GPU就能跑! 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 刚刚,教母亲自宣布对外推出全新模型 RTFM (A Real-Time Frame Model),不仅具备实时运行、持久性和3D一致性,更关键的是—— 单张H100 GPU就能跑 。 此外,RTFM的设计遵循三大核心原则: 效率:仅需单张H100 GPU,RTFM便能以交互级帧率实时完成推理运算。 可扩展性:该架构具备随数据量与算力增长而持续扩展的能力。它通过端到端的通用架构从海量视频数据中自主学习,无需依赖显式3D表征 即可构建三维世界模型。 持久性:用户可无限时长与RTFM交互,所有场景将永久留存。该系统构建的持久化3D世界不会因视角转换而消失。 下面具体来看。 世界模型需要大量计算资源 强大的世界模型能够实时重建、生成并模拟具有持久性、可交互且物理精确的世界。这类模型将彻底改变从媒体到机器人技术等各行各业。 过去 ...
李飞飞全新「世界模型」问世,单张H100实时生成3D永恒世界
36氪· 2025-10-17 09:47
Core Viewpoint - The article discusses the release of RTFM (Real-Time Frame Model), a highly efficient generative world model developed by World Labs, which can render persistent 3D worlds in real-time using a single H100 GPU [2][4][12]. Group 1: RTFM Features - RTFM operates without explicit 3D representations, generating new 2D images from one or more input images [6][7]. - The model learns to simulate complex physical phenomena like 3D geometry, reflections, and shadows solely from observing training video data [9]. - RTFM is designed around three core principles: efficiency, scalability, and persistence [12][14]. Group 2: Efficiency and Scalability - RTFM can run real-time inference at interactive frame rates with just one H100 GPU, making it a practical solution for current hardware [14][38]. - The model's architecture allows it to scale with increasing data and computational power, avoiding reliance on explicit 3D representations [14][44]. - RTFM is viewed as a "learning renderer," capable of generating new views from 2D images without manual design [46][48]. Group 3: Persistence and Memory - RTFM addresses the challenge of persistence by modeling the pose of each frame in 3D space, allowing for a structured memory of the world [60][64]. - The model employs "context juggling" to maintain geometric persistence in large scenes during long interactions [66][67]. - This approach enables RTFM to generate content in different spatial areas while preserving the context of the generated world [66][67]. Group 4: Future Prospects - RTFM sets a technological roadmap for future world models, emphasizing the potential for real-time deployment on current hardware [69]. - There are exciting directions for expanding RTFM, such as simulating dynamic worlds and enhancing user interaction with generated environments [70]. - The team aims to improve performance with larger models that can operate under higher inference budgets [71].
“AI教母”李飞飞发布实时生成式世界模型!一张H100就能运行
第一财经· 2025-10-17 06:32
Core Viewpoint - World Labs, founded by AI expert Fei-Fei Li, has introduced a new real-time generative world model called RTFM, which operates efficiently on a single H100 GPU and aims to create a persistent 3D world [3][5][6]. Group 1: Technology and Model Features - RTFM is designed around three key principles: efficiency, scalability, and persistence, allowing it to run on minimal GPU resources while expanding with increased data and computational power [5]. - The model is based on a highly efficient autoregressive diffusion Transformer, trained on large-scale video data to learn 3D geometry, reflections, and shadows [6]. - The computational demands for generating interactive 4K video streams are significant, requiring over 100,000 tokens per second, with context tokens exceeding 100 million for sustained interactions [6]. Group 2: Market Potential and Applications - The generative world models are expected to revolutionize various industries, particularly content production, targeting game companies and film studios [7]. - World Labs has raised approximately $230 million in funding, achieving a valuation exceeding $1 billion, positioning itself as a new unicorn in the AI sector [7]. - The technology is anticipated to have broad applications across fields such as art, design, engineering, and robotics, with a focus on enhancing spatial intelligence [8]. Group 3: Future Plans and Challenges - World Labs plans to focus on building models that deeply understand three-dimensionality, physicality, and concepts of space and time, with future support for AR and robotics [9]. - The team acknowledges challenges in establishing a profitable business model and aims to overcome these boundaries as they progress [9].
“AI教母”李飞飞发布实时生成式世界模型!一张H100就能运行
Di Yi Cai Jing· 2025-10-17 04:40
Core Insights - The new real-time generative world model RTFM developed by World Labs is designed to run on a single H100 GPU, emphasizing efficiency, scalability, and persistence [1][4][5] - The model is based on large-scale video data and is an autoregressive diffusion Transformer, capable of modeling 3D geometry, reflections, and shadows [4][5] - World Labs aims to create a virtual 3D space where users can control physical variables, with significant implications for various industries including gaming and film production [8][9] Group 1: Model Features - RTFM operates under three key principles: efficiency, scalability, and persistence, allowing it to run on minimal GPU resources while expanding with increased data and computational power [4][5] - The model's computational demands are expected to exceed those of current large language models, with the need to generate over 100,000 tokens per second for 4K interactive video streams [4][5] Group 2: Company Background - World Labs, founded by Fei-Fei Li in 2024, has raised approximately $230 million, achieving a valuation of over $1 billion, making it a new unicorn in the AI sector [8][9] - The company has received investments from prominent players in the tech and venture capital space, including a16z, NVIDIA NVentures, AMD Ventures, and Intel Capital [8] Group 3: Future Plans - World Labs plans to focus on building models with a deep understanding of 3D, physical, and spatial concepts, with future support for augmented reality (AR) and robotics [10]
单块GPU上跑出实时3D宇宙,李飞飞世界模型新成果震撼问世
机器之心· 2025-10-17 02:11
Core Insights - The article discusses the launch of RTFM (Real-Time Frame Model), a generative world model that can run on a single H100 GPU, enabling real-time, consistent 3D world generation from 2D images [2][3][10]. Group 1: RTFM Overview - RTFM generates new 2D images from one or more 2D inputs without explicitly constructing a 3D representation, functioning as a learning-based renderer [5][17]. - The model is trained on large-scale video data and learns to model 3D geometry, reflections, and shadows through observation [5][17]. - RTFM blurs the line between reconstruction and generation, handling both tasks simultaneously based on the number of input views [20]. Group 2: Technical Requirements - Generative world models like RTFM require significant computational power, with the need to output over 100,000 tokens per second for interactive 4K video streams [11]. - To maintain consistency in interactions lasting over an hour, the model must process over 100 million tokens of context [12]. - Current computational infrastructure makes such demands economically unfeasible, but RTFM is designed to be efficient enough to run on existing hardware [13][15]. Group 3: Scalability and Persistence - RTFM is designed to be scalable, allowing it to benefit from future reductions in computational costs [14]. - The model addresses the challenge of persistence in generated worlds by modeling the spatial pose of each frame, enabling it to remember and reconstruct scenes over time [23][24]. - Context juggling mechanisms allow RTFM to maintain geometric structure in large scenes while ensuring true world persistence [25].
李飞飞全新「世界模型」问世,单张H100实时生成3D永恒世界
3 6 Ke· 2025-10-17 01:48
Core Insights - The article discusses the launch of RTFM (Real-Time Frame Model), a highly efficient autoregressive diffusion Transformer model capable of real-time rendering of persistent and 3D-consistent worlds using a single H100 GPU [1][5][18]. Group 1: Model Features - RTFM does not create explicit 3D representations but generates new 2D images from one or more input 2D images, functioning as an "AI that has learned to render" [3][15]. - The model learns to simulate complex physical phenomena such as 3D geometry, reflections, and shadows solely from observing training videos [5][24]. - RTFM is designed around three core principles: efficiency, scalability, and persistence [5][31]. Group 2: Efficiency and Scalability - RTFM can operate in real-time with interactive frame rates using only one H100 GPU, making it highly efficient [5][22]. - The model's architecture allows it to scale with increasing data and computational power, learning from large-scale video data without relying on explicit 3D representations [5][23]. - The model is seen as a "learning renderer," converting input frames into neural network activations to implicitly represent the world [23][29]. Group 3: Persistence and Contextual Memory - RTFM addresses the challenge of persistence by modeling the pose (position and orientation) of each frame in 3D space, allowing the world to remain consistent even when the user looks away [31][35]. - The model employs "context juggling" to maintain geometric persistence in large scenes during long interactions, retrieving nearby frames from spatial memory [37][38]. - This approach enables RTFM to generate new frames while preserving the context of the world, enhancing the user experience [37][38]. Group 4: Future Prospects - RTFM sets a technological roadmap for future world models, demonstrating the potential for deployment on current hardware while paving the way for larger models with improved performance [38][39]. - The team envisions expanding RTFM to simulate dynamic worlds and enhance user interaction with the generated environments [38].
李飞飞发布全新世界模型,单GPU就能跑
3 6 Ke· 2025-10-17 01:45
Core Insights - The newly launched RTFM (A Real-Time Frame Model) by Fei-Fei Li is designed to operate in real-time with persistence and 3D consistency, requiring only a single H100 GPU for operation [1][10] - RTFM is built on three core principles: efficiency, scalability, and persistence, allowing for real-time inference at interactive frame rates, continuous expansion with data and computational power, and permanent retention of all scenes [1][6] Group 1: Model Capabilities - RTFM can generate and simulate a persistent, interactive, and physically accurate world, which has the potential to transform various industries from media to robotics [3][5] - The model's efficiency allows it to perform real-time inference with just one H100 GPU, making it immediately deployable while ensuring that the virtual world remains intact during user interactions [1][6] Group 2: Technical Innovations - RTFM utilizes a novel approach by training a single neural network to generate 2D images from 2D inputs without requiring explicit 3D representations, thus simplifying the modeling process [7][8] - The model employs a self-regressive diffusion transformer architecture, trained end-to-end on vast video data, enabling it to predict subsequent frames based on historical data [7][8] Group 3: Memory and Persistence - RTFM addresses the challenge of persistence by modeling each frame with a spatial pose, allowing the model to maintain a memory of the world without the need for explicit 3D geometry [9][10] - The concept of context juggling enables the model to generate content in different spatial areas using varying contextual frames, thus maintaining a long-term memory of large worlds during extended interactions [10]
李飞飞发布全新世界模型,单GPU就能跑!
量子位· 2025-10-17 01:04
Core Insights - The article discusses the launch of a new model called RTFM (A Real-Time Frame Model) by Fei-Fei Li, which operates in real-time, has persistence, and maintains 3D consistency, all while being able to run on a single H100 GPU [1][2]. Group 1: Model Features - RTFM is designed with three core principles: efficiency, scalability, and persistence. It can perform real-time inference at interactive frame rates using only one H100 GPU [2]. - The model is capable of continuous interaction with users, allowing all scenes to be permanently stored, thus creating a persistent 3D world that does not disappear with changes in perspective [3]. Group 2: Computational Requirements - Powerful world models require significant computational resources to reconstruct, generate, and simulate persistent, interactive, and physically accurate environments, which could revolutionize various industries from media to robotics [5]. - The demand for computational power in generative world modeling is expected to exceed that of current large language models, with the need to generate over 100,000 tokens per second for 60 frames of 4K interactive video [7][8]. Group 3: Design Philosophy - The team believes that methods that can elegantly scale with increasing computational power will dominate the AI field, benefiting from the exponential decrease in computing costs over decades [9]. - The goal was to create a highly efficient generative world model that can be deployed immediately and can scale with increased computational power, all while being driven by a single H100 GPU [10]. Group 4: Learning Renderer - RTFM employs a novel approach by using a single neural network to generate 2D images from one or more input images without relying on explicit 3D representations [12]. - The model utilizes an autoregressive diffusion transformer architecture trained on vast amounts of video data, allowing it to predict subsequent frames based on historical data [13]. Group 5: Memory and Persistence - RTFM addresses the challenge of persistence by modeling each frame with a pose in 3D space, allowing the generation of new frames based on the provided pose [18]. - The model's memory structure is spatially organized, enabling it to maintain a persistent memory of the world without explicitly predicting the 3D geometry of objects [19]. - The technique of context juggling allows RTFM to maintain long-term memory of large worlds during extended interactions without the need for extensive computational resources [20].
全球首款AI原生游戏引擎再进化:GTA6再不来,我们就AI一个
3 6 Ke· 2025-08-22 09:17
Core Insights - The article discusses the advancements in the AI-driven game engine, Mirage 2, which has evolved significantly from its predecessor, Mirage 1, in just over a month [2][4][17]. Group 1: Mirage 2 Features - Mirage 2 is described as a generative world engine that allows users to create, experience, and modify any interactive world, not limited to gaming [2][4]. - It supports image uploads to convert them into interactive game worlds and allows real-time dialogue for modifying the game environment through text commands [5][11]. - The engine has improved performance metrics, including faster prompt control, reduced game latency to 200ms, and the ability to run on a single consumer GPU [5][14][13]. Group 2: Comparison with Competitors - Mirage 2 is positioned to compete with DeepMind's Genie 3, offering more interactive capabilities such as running, jumping, and attacking, with a longer interaction horizon of over 10 minutes [11][13]. - The article highlights that Mirage 2 has made significant improvements in object proportions and scene understanding compared to Mirage 1, achieving a more realistic representation of characters and vehicles [14][17]. Group 3: Technical Challenges - Despite the advancements, there are still technical issues to address, such as action control precision and visual consistency during rapid scene changes [16][17]. - The article notes that while Mirage 2 has made strides, it still falls short of the consistency demonstrated by Genie 3, indicating areas for further development [16][17].