俄罗斯方块

Search documents
强化学习的两个「大坑」,终于被两篇ICLR论文给解决了
机器之心· 2025-07-17 09:31
Core Viewpoint - The article discusses the emergence of real-time reinforcement learning (RL) frameworks that address the limitations of traditional RL algorithms, particularly in dynamic environments where timely decision-making is crucial [1][4]. Group 1: Challenges in Traditional Reinforcement Learning - Existing RL algorithms often rely on an idealized interaction model where the environment and agent take turns pausing, which does not reflect real-world scenarios [3][4]. - Two key difficulties in real-time environments are identified: inaction regret, where agents may not act at every step due to long reasoning times, and delay regret, where actions based on past states lead to delayed impacts [7][8]. Group 2: New Frameworks for Real-Time Reinforcement Learning - Mila laboratory's two papers propose a new real-time RL framework to tackle reasoning delays and action omissions, enabling large models to respond instantly in high-frequency, continuous tasks [9]. - The first paper introduces an asynchronous multi-process reasoning and learning framework that allows agents to utilize available computational power effectively, thereby eliminating inaction regret [11][15]. Group 3: Performance in Real-Time Applications - The first paper demonstrates the framework's effectiveness in capturing Pokémon in the game "Pokémon: Blue" using a model with 100 million parameters, emphasizing the need for rapid adaptation to new scenarios [17]. - The second paper presents an architecture solution to minimize inaction and delay in real-time environments, drawing parallels to early CPU architectures and introducing parallel computation mechanisms in neural networks [22][24]. Group 4: Combining Techniques for Enhanced Performance - The combination of staggered asynchronous inference and temporal skip connections allows for reduced inaction and delay regrets, facilitating faster decision-making in real-time systems [27][36]. - This integration enables the deployment of powerful, responsive agents in critical fields such as robotics, autonomous driving, and financial trading, where response speed is essential [36][37].
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位· 2025-06-16 04:50
Core Viewpoint - Classic nostalgic games like Sokoban and Tetris have become benchmarks for evaluating large models, with the o3-pro model recently surpassing previous performance limits in these games [1][2][6]. Group 1: Benchmark Performance - The o3-pro model successfully completed all levels of Sokoban, which previously had a benchmark limit at the sixth level [3][8]. - In comparison to the previous state-of-the-art model (SOTA), o3, the performance of o3-pro has doubled [3][10]. - The scoring system for Tetris involves calculating the number of placed blocks and the number of cleared lines multiplied by ten, until the game ends [13][22]. Group 2: Game Characteristics and Evaluation - The Lmgame benchmark includes several games, such as 2048, Candy Crush, Super Mario Bros, and Phoenix Wright, each with unique evaluation criteria [18][24]. - The evaluation for 2048 is based on the total value of merged blocks, while Candy Crush measures the total candies eliminated in a fixed number of rounds [24]. - The evaluation methods do not consider time as a factor, focusing instead on game-specific performance metrics [22][24]. Group 3: Model Development and Support - The project is developed by the Hao AI Lab at UCSD, which is affiliated with the machine learning systems and NLP labs [28]. - The lab has received funding from Google and NVIDIA, with NVIDIA donating a DGX B200 system to support their research [34]. - The benchmark is open-source, allowing interested parties to download and test their models [23].
o3-pro通关“推箱子”,人类怀旧小游戏成了大模型新Benchmark
量子位· 2025-06-16 04:49
Core Viewpoint - Classic nostalgic games like "Sokoban" and "Tetris" have become benchmarks for evaluating large models, with the o3-pro model achieving significant breakthroughs in these games [1][6]. Group 1: Benchmark Performance - The o3-pro model surpassed previous benchmarks by completing all levels of Sokoban, while the best prior model, o3, only reached the sixth level [2][3]. - In Tetris, the scoring system combines the number of placed blocks with ten times the number of cleared lines, and o3-pro's performance doubled that of o3 [3][13]. - The o3-pro model's performance is notable for its time-consuming operations, taking several minutes for each move [17]. Group 2: Game Evaluation Standards - The Lmgame benchmark includes various games, with specific evaluation metrics for each, such as total distance moved in Super Mario Bros and total candy cleared in Candy Crush [6][24]. - The evaluation does not consider time as a factor, focusing instead on game-specific performance metrics [22]. - The benchmark is open-source, allowing others to download and test their models [23]. Group 3: Development and Support - The project is developed by the Hao AI Lab at UCSD, which has received support from Google and NVIDIA [28][34]. - The lab has created multiple open-source projects, with FastVideo being the most starred on GitHub [32].