DeepMind再登Nature：AI Agent造出了最强RL算法

Core Insights - The main objective of artificial intelligence (AI) is to design agents capable of autonomously predicting, acting, and achieving goals in complex environments. The challenge has been to enable these agents to independently develop efficient reinforcement learning (RL) algorithms [1][2]. Group 1: Discovery Methodology - Google DeepMind introduced a method called DiscoRL, which allows agents to autonomously discover RL rules through interactions in various environments. This method outperformed existing RL algorithms in both known and challenging benchmark tests [1][2]. - The discovery process involves two types of optimization: agent optimization and meta-optimization. Agents optimize their parameters by updating their strategies and predictions, while the meta-network optimizes the goals of the RL rules to maximize cumulative rewards [3][5]. Group 2: Performance Evaluation - DiscoRL was evaluated using the interquartile mean (IQM) as a performance metric, demonstrating superior performance over existing RL algorithms like MuZero and Dreamer in the Atari benchmark tests [7][8]. - The Disco57 rule, trained on 57 Atari games, achieved an IQM score of 13.86, surpassing all current RL rules and showing significant efficiency improvements over MuZero [8][14]. Group 3: Generalization and Robustness - The generalization capability of Disco57 was tested across 16 independent benchmark tests, outperforming all published methods, including MuZero and PPO. It also showed competitive performance in the Crafter benchmark and ranked third in the NetHack NeurIPS 2021 challenge without using domain-specific knowledge [9][11]. - Disco103, discovered in 103 environments, demonstrated comparable performance to Disco57 in Atari benchmarks and reached human-level performance in the Crafter benchmark, indicating that more complex and diverse environments lead to stronger and more generalizable RL rules [11][14]. Group 4: Efficiency and Scalability - The optimal performance of Disco57 was achieved within approximately 600 million steps per game, significantly more efficient than traditional human-designed RL rules, which require more experimental iterations and time [14][18]. - The performance of the discovered RL rules improved with the increase in the number of training environments, suggesting that the effectiveness of the discovered RL is dependent on the data (environments) and computational resources available [14][17].