大神爆肝一个月，复刻DeepMind世界模型，300万参数就能玩实时交互像素游戏

Core Insights - The article discusses the development of TinyWorlds, a world model created by the X blogger anandmaj, which replicates the core ideas of DeepMind's Genie 3 with only 3 million parameters, capable of generating playable pixel-style environments in real-time [1][6]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities similar to those found in large language models (LLMs) [2][6]. - DeepMind's Genie 3 demonstrated that training on large-scale video data allows for the emergence of advanced behaviors without the need for action-labeled data [2][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, including titles like Pong, Sonic, Zelda, Pole Position, and Doom, which define the environments the model can generate [7]. Group 3: Model Architecture - The core of TinyWorlds is a Space-time Transformer that captures video information through spatial attention, temporal attention, and a feedforward network [10]. - The model employs an action tokenizer to automatically generate frame-to-frame action labels, enabling training on unlabeled data [18]. Group 4: Training Dynamics - The dynamics model serves as the "brain" of the system, combining video and action inputs to predict future frames, with initial performance limitations addressed by scaling the model [21]. - The introduction of masked frames and variance loss during training helps the model better utilize action signals [20]. Group 5: Performance and Future Prospects - Despite having only 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, although the output remains somewhat blurry and incoherent [23][24]. - The author suggests that scaling the model to hundreds of billions of parameters and incorporating diffusion methods could significantly enhance the quality of generated content [24].