Workflow
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位·2025-06-16 04:49

Core Viewpoint - Classic nostalgic games like "Sokoban" and "Tetris" have become benchmarks for evaluating large models, with the o3-pro model achieving significant breakthroughs in these games [1][6]. Group 1: Benchmark Performance - The o3-pro model surpassed previous benchmarks by completing all levels of Sokoban, while the best prior model, o3, only reached the sixth level [2][3]. - In Tetris, the scoring system combines the number of placed blocks with ten times the number of cleared lines, and o3-pro's performance doubled that of o3 [3][13]. - The o3-pro model's performance is notable for its time-consuming operations, taking several minutes for each move [17]. Group 2: Game Evaluation Standards - The Lmgame benchmark includes various games, with specific evaluation metrics for each, such as total distance moved in Super Mario Bros and total candy cleared in Candy Crush [6][24]. - The evaluation does not consider time as a factor, focusing instead on game-specific performance metrics [22]. - The benchmark is open-source, allowing others to download and test their models [23]. Group 3: Development and Support - The project is developed by the Hao AI Lab at UCSD, which has received support from Google and NVIDIA [28][34]. - The lab has created multiple open-source projects, with FastVideo being the most starred on GitHub [32].