OTC‑PO重磅发布 | 揭开 o3 神秘面纱，让 Agent 少用工具、多动脑子！

Core Insights - The article introduces a novel reinforcement learning framework called Optimal Tool Calling Policy Optimization (OTC-PO), which encourages language models to generate correct answers through optimal tool usage, focusing on both effectiveness and efficiency of tool interactions [22]. Group 1: Agent Behavior Patterns - Agents exhibit two primary behavior patterns: Reasoning and Acting, where Reasoning focuses on internal cognitive processes and Acting involves interaction with external tools and APIs [4][5]. - The article discusses the potential confusion between Reasoning and Acting behaviors when models overly focus on the correctness of final answers, leading to cognitive offloading and inefficient tool usage [5][16]. Group 2: Reward Function Design - Different reward functions are proposed to optimize the balance between Reasoning and Acting, aiming to minimize unnecessary tool calls while maximizing the model's reasoning capabilities [6][12]. - The article emphasizes the importance of defining a minimal number of tool calls required for a model to answer a question, which varies based on the model's capabilities and the problem's complexity [11]. Group 3: Performance Metrics - The proposed method achieves a 73.1% reduction in tool calls and a 229.4% increase in tool efficiency without sacrificing accuracy, demonstrating significant improvements in training time and model performance as model size increases [10][16]. - The OTC-PO framework shows superior performance in both in-domain and out-of-domain evaluations compared to existing models, indicating its robustness and adaptability across various scenarios [20]. Group 4: Cognitive Offloading - The article identifies cognitive offloading as a phenomenon where larger models tend to rely excessively on external tools, hindering their reasoning development, and suggests that minimizing tool calls can enhance the model's cognitive abilities [16][21]. - A case study illustrates that minimizing tool usage can lead to smarter tool application and improved reasoning capabilities, aligning with the desired behavior of models like OpenAI's o3 [21].