Workflow
TinyWorlds
icon
Search documents
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
3 6 Ke· 2025-09-28 10:51
Core Insights - The article discusses the development of TinyWorlds, a world model created by the X blogger anandmaj, which replicates the core ideas of DeepMind's Genie 3 with only 3 million parameters, capable of generating playable pixel-style environments in real-time [1][6]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities similar to those found in large language models (LLMs) [2][6]. - DeepMind's Genie 3 demonstrated that training on large-scale video data allows for the emergence of advanced behaviors without the need for action-labeled data [2][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, including titles like Pong, Sonic, Zelda, Pole Position, and Doom, which define the environments the model can generate [7]. Group 3: Model Architecture - The core of TinyWorlds is a Space-time Transformer that captures video information through spatial attention, temporal attention, and a feedforward network [10]. - The model employs an action tokenizer to automatically generate frame-to-frame action labels, enabling training on unlabeled data [18]. Group 4: Training Dynamics - The dynamics model serves as the "brain" of the system, combining video and action inputs to predict future frames, with initial performance limitations addressed by scaling the model [21]. - The introduction of masked frames and variance loss during training helps the model better utilize action signals [20]. Group 5: Performance and Future Prospects - Despite having only 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, although the output remains somewhat blurry and incoherent [23][24]. - The author suggests that scaling the model to hundreds of billions of parameters and incorporating diffusion methods could significantly enhance the quality of generated content [24].
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].