从「会说」迈向「会做」，LLM下半场：Agentic强化学习范式综述

Core Insights - The article discusses the evolution of training paradigms for large language models (LLMs) from Preference-based Reinforcement Fine-tuning (PBRFT) to Agentic Reinforcement Learning (Agentic RL), highlighting the limitations of PBRFT and the advantages of Agentic RL in enabling LLMs to engage in proactive decision-making and long-term planning [2][4][37]. Paradigm Shift - The transition from PBRFT to Agentic RL is defined formally, where PBRFT is seen as a degenerate single-step Markov Decision Process (MDP), while Agentic RL operates under a partially observable Markov decision process (POMDP) framework, allowing for multi-step interactions [6][8]. - Key changes include the expansion of action space from pure text sequences to include both text and actions, and the reward structure evolving from single-step scoring to temporal feedback, optimizing the entire decision trajectory [8][10]. Core Capabilities of Agentic RL - Six core capabilities are identified as essential for LLMs to function as agents: 1. Planning: Setting sub-goals and multi-step action sequences for complex tasks [14]. 2. Tool Use: Learning to autonomously select and combine external tools [15]. 3. Memory: Maintaining context and accumulating knowledge through various memory management techniques [17]. 4. Self-Improvement: Enhancing capabilities through self-correction and iterative self-training [18]. 5. Reasoning: Developing both intuitive and systematic reasoning abilities [19]. 6. Perception: Understanding and processing multi-modal inputs actively [19]. Applications and Evolution - Agentic RL is expanding into various application domains, including search and research optimization, code generation, mathematical reasoning, graphical user interface (GUI) interactions, and multi-agent systems [25][26][27][28]. - The framework for Agentic RL is supported by a variety of experimental environments and tools, facilitating research and development [32][33]. Challenges and Future Directions - Despite its potential, Agentic RL faces challenges such as ensuring reliability and safety, scaling up training processes, and creating environments that accurately reflect real-world complexities [35][39]. - The article emphasizes the need for overcoming these challenges to enable LLMs to transition from merely "speaking" to "doing," thereby evolving into more autonomous and versatile agents [38][39].