Meta-Think ≠ 记套路，多智能体强化学习解锁大模型元思考泛化

Core Viewpoint - The article discusses a new framework called ReMA (Reinforced Meta-thinking Agents) designed to enhance the reasoning capabilities of large language models (LLMs) by introducing a multi-agent system that separates meta-thinking from reasoning tasks, thereby improving adaptability and effectiveness in complex problem-solving [3][4][6][10]. Group 1: Introduction and Background - Recent explorations in large model reasoning have introduced various paradigms, including structured search and process reward models, but the mechanisms behind "Aha Moments" in reasoning remain unclear [3]. - The study emphasizes the importance of reasoning patterns and posits that the strength of complex reasoning in large models fundamentally relies on their meta-thinking abilities [3][4]. Group 2: ReMA Framework - The ReMA framework consists of two hierarchical agents: the meta-thinking agent, which generates strategic supervision and planning, and the reasoning agent, which executes detailed sub-tasks based on the meta-thinking agent's guidance [10][11]. - This multi-agent system allows for a more structured and efficient exploration of the reasoning process, balancing generalization capabilities and exploration efficiency [12]. Group 3: Methodology - The study defines a single-round multi-agent meta-thinking reasoning process (MAMRP) where the meta-thinking agent analyzes the problem and generates a solution plan, while the reasoning agent completes the task based on these instructions [13][14]. - In multi-round interactions, the meta-thinking agent can provide ongoing guidance, allowing for planning, reflection, and correction throughout the reasoning process [14][20]. Group 4: Experimental Results - In single-round experiments, ReMA consistently outperformed baseline methods across various benchmarks, demonstrating superior generalization capabilities, particularly on out-of-distribution datasets [27][28]. - The results indicate that ReMA's meta-thinking mechanism significantly enhances performance, with improvements noted in specific benchmarks such as AMC23, where performance increased by up to 20% [28][29]. Group 5: Challenges and Future Work - The study acknowledges challenges in multi-round training, including instability and sensitivity to hyperparameters, suggesting that the current training processes may not be suitable for stochastic or non-stationary environments [39][40]. - Further exploration is needed to address these issues and improve the robustness of the ReMA framework in diverse training scenarios [39].