字节发布通用游戏智能体！5000亿token训练，用鼠标键盘吊打GPT-5！

Core Insights - The article discusses the development of Game-TARS, a general-purpose game agent created by ByteDance's Seed team, capable of playing various games like Minecraft, Temple Run, and Stardew Valley, and even adapting to unseen 3D web games through zero-shot transfer [3][4][5]. Group 1: Game-TARS Overview - Game-TARS utilizes a unified and scalable keyboard-mouse action space for extensive pre-training across operating systems, web, and simulated environments, leveraging over 500 billion labeled multimodal training data [4][20]. - The agent outperforms existing models such as GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS, open-world, and web games [5][29]. Group 2: Innovation and Design - The core innovation of Game-TARS is its ability to operate like a human using keyboard and mouse, rather than executing predefined functions, allowing for more natural interaction with games [6][9]. - Game-TARS focuses on Human Actions, decoupling its action instruction set from specific applications or operating systems, enabling direct alignment with human interaction methods [9][10]. Group 3: Training Process - Unlike traditional game bots, Game-TARS integrates visual perception, strategic reasoning, action execution, and long-term memory into a single visual language model (VLM) [12][13]. - The training process involves a two-phase approach: continuous pre-training and post-training, with over 20,000 hours and approximately 500 billion tokens of game data used for large-scale pre-training [15][20][22]. Group 4: Experimental Validation - The effectiveness of the unified action space and large-scale continuous pre-training was validated through tests in Minecraft, demonstrating improved performance compared to previous expert models [24][28]. - Game-TARS shows significant scalability in both training and inference processes, enhancing its capabilities across various tasks and environments [31][34].