o3-pro通关“推箱子”，人类怀旧小游戏成了大模型新Benchmark

Core Viewpoint - Classic nostalgic games like Sokoban and Tetris have become benchmarks for evaluating large models, with the o3-pro model recently surpassing previous performance limits in these games [1][2][6]. Group 1: Benchmark Performance - The o3-pro model successfully completed all levels of Sokoban, which previously had a benchmark limit at the sixth level [3][8]. - In comparison to the previous state-of-the-art model (SOTA), o3, the performance of o3-pro has doubled [3][10]. - The scoring system for Tetris involves calculating the number of placed blocks and the number of cleared lines multiplied by ten, until the game ends [13][22]. Group 2: Game Characteristics and Evaluation - The Lmgame benchmark includes several games, such as 2048, Candy Crush, Super Mario Bros, and Phoenix Wright, each with unique evaluation criteria [18][24]. - The evaluation for 2048 is based on the total value of merged blocks, while Candy Crush measures the total candies eliminated in a fixed number of rounds [24]. - The evaluation methods do not consider time as a factor, focusing instead on game-specific performance metrics [22][24]. Group 3: Model Development and Support - The project is developed by the Hao AI Lab at UCSD, which is affiliated with the machine learning systems and NLP labs [28]. - The lab has received funding from Google and NVIDIA, with NVIDIA donating a DGX B200 system to support their research [34]. - The benchmark is open-source, allowing interested parties to download and test their models [23].