Workflow
智能体强化学习
icon
Search documents
NeurIPS 2025 | 中科大、港中深、通义千问联合发布CoRT:仅30个样本教会大模型高效推理,token消耗降低50%
机器之心· 2025-11-12 13:23
Core Insights - The article discusses the advancements in large reasoning models (LRMs) like OpenAI-o1, Qwen3, and DeepSeek-R1, which excel in complex reasoning tasks but struggle with precise mathematical calculations [2] - A new framework called CoRT (Code-Optimized Reasoning Training) is introduced, aimed at enhancing the efficiency of large language models by teaching them to effectively utilize code tools for reasoning [3][8] Group 1: Challenges in Current Models - Current models face cognitive conflicts between probabilistic reasoning and deterministic knowledge from external tools, leading to inefficiencies [4] - Models often engage in lengthy natural language reasoning before verifying results with code, resulting in delayed calculations and unnecessary distrust in code outputs [4] - There is a scarcity of high-quality training data for the new "model-tool" collaborative reasoning paradigm, posing a significant challenge [4] Group 2: CoRT Framework Overview - CoRT aims to reshape the interaction between models and tools, transitioning from inefficient verification to efficient computation [8][16] - The framework employs a three-step approach: data cold start, intelligent agent tuning, and advanced training processes [8] Group 3: Hint-Engineering Strategy - Hint-Engineering is introduced as a novel data synthesis strategy to generate high-quality interaction data, correcting inefficient model behaviors at critical decision points [9] - By strategically injecting guiding prompts, the model can be directed to simplify reasoning through code, enhancing efficiency [10][11] Group 4: Multi-Stage Training Process - CoRT incorporates a comprehensive training pipeline consisting of Supervised Fine-Tuning (SFT), Reject Sampling Fine-Tuning (RFT), and Reinforcement Learning (RL) [13] - Initial fine-tuning with high-quality samples allows the model to learn efficient interaction patterns, while RFT filters out poor trajectories to reinforce good behaviors [13] - The RL component enables the model to autonomously learn optimal tool usage strategies through interaction with the code interpreter [13] Group 5: Performance and Efficiency Gains - CoRT has been evaluated on five challenging mathematical reasoning benchmarks, demonstrating significant performance improvements [14] - The framework achieved a 4% absolute accuracy increase for the DeepSeek-R1-32B model and up to an 8% increase for the 1.5B model, outperforming many data-intensive models [20] - Token consumption was reduced by approximately 30% for the 32B model and an impressive 50% for the 1.5B model compared to baseline models [20] Group 6: Implications and Future Directions - The introduction of CoRT provides a new pathway for addressing the shortcomings of large language models in precise reasoning tasks, showcasing the potential for more powerful and reliable AI systems [16][17] - Future research will focus on expanding the framework to incorporate a wider variety of tools and more complex task scenarios [17]
AEPO:智能体熵平衡策略优化,让探索更稳,推理更深!
机器之心· 2025-11-01 04:22
Core Insights - The article discusses the development of Agentic Entropy-Balanced Policy Optimization (AEPO), a new algorithm aimed at balancing exploration and stability in multi-round reinforcement learning for intelligent agents [2][10][11]. Group 1: Algorithm Overview - AEPO addresses the issues of "high-entropy rollout collapse" and "high-entropy gradient clipping" in existing Agentic RL methods, proposing two core mechanisms: dynamic entropy-balanced rollout sampling and entropy-balanced policy optimization [2][11]. - The algorithm has shown significant improvements over seven mainstream reinforcement learning algorithms across 14 cross-domain benchmarks, particularly in deep search tasks [4][12]. Group 2: Performance Metrics - AEPO achieved a Pass@5 score of 61.5% in deep search tasks, outperforming other methods such as ARPO and GRPO by an average of 5.8% [36][37]. - The algorithm maintains training stability while enhancing sampling diversity and reasoning efficiency, providing a new optimization paradigm for scalable reinforcement training of general intelligent agents [4][12]. Group 3: Research Motivation - The motivation behind AEPO is to find a balance in high-entropy environments, where excessive exploration can lead to instability and local optima [8][10]. - The research highlights the dual contradiction of high-entropy signals, which are necessary for exploration but can disrupt resource allocation and hinder learning [14][20]. Group 4: Future Directions - Future research may expand AEPO to multi-modal inputs, complex tool ecosystems, and multi-agent reinforcement learning scenarios to enhance collaborative strategies and performance [41].
只需1/4预算,性能反超基线:阿里高德提出Tree-GRPO,高效破解智能体RL难题
机器之心· 2025-10-13 23:56
对于大模型的强化学习已在数学推理、代码生成等静态任务中展现出不俗实力,而在需要与开放世界交互的智能体任务中,仍面临「两朵乌云」:高昂的 Rollout 预算(成千上万的 Token 与高成本的工具调用)和极其稀疏的「只看结果」的奖励信号。 来自阿里高德的一篇最新研究论文提出了面向 Agent RL 的 Tree-GRPO 方法,将独立的链式采样改造为智能体步骤级的树搜索。该方法通过共享前缀、一次扩展 多个分支,在相同预算下获得更丰富的有效轨迹;更重要的是,仅凭最终奖励即可沿树结构回溯出过程中的偏好信号,等价于隐式的步骤级偏好学习。 在 11 个知识密集型、网络搜索问答任务数据集中,Tree-GRPO 在多种模型规模上 更省预算、更高表现 ,显著优于链式 RL 方法,甚至能在 1/4 预算的情况下超越 GRPO 基线,为 Agentic RL 的高效训练提供了新的解决思路。 论文标题:Tree Search for LLM Agent Reinforcement Learning 以「智能体步骤」为节点进行树搜索 树方法相较链方法的区别与优势 论文地址: https://arxiv.org/abs/2509.2 ...