智能体强化学习 - filings, earnings calls, financial reports, news

智能体强化学习

Search documents

机器之心· 2026-03-03 02:55

机器之心编辑部近日，来自字节跳动 Seed 团队和清华大学 AIR 的新研究 CUDA Agent ，在 AI 领域引发了不小的轰动。研究人员训练了一个能够编写快速 CUDA 内核的模型：不只是正确的内核，而是真正经过优化的内核。在简单/中等内核上，它的性能比 torch.compile 高出 2 倍；在复杂内核上，它的性能比 torch.compile 高出约 92% ；即使在最难的设置下，它的性能也比 Claude Opus 4.5 和 Gemini 3 Pro 高出约 40% 。针对这一矛盾，CUDA Agent 的核心理念简单而巧妙：CUDA 性能并非取决于正确性，而是取决于硬件。线程束、内存带宽、内存冲突——这些只有在性能分析器中才能看到的东西。研究人员不再奖励「是否编译成功」，而是奖励实际的GPU速度。真实的性能分析数据。强化学习直接基于性能进行训练。在此之前，GPT、Claude 等大模型已经能写出「正确」的 CUDA 代码，AI 生成的代码也已获得了一定程度的应用，但能跑通和跑得快完全是两码事。 GPU 内核优化是现代深度学习的基础，但它仍然是一项高度专业化的工作，需要深厚 ...

智能体强化学习

CUDA内核优化

Artificial Intelligence

Artificial Intelligence

CUDA Agent

torch.compile

CUDA-Agent-Ops-6K

MiniMax新模型比肩海外头部，国产大模型开启“月更”

Nan Fang Du Shi Bao· 2026-02-14 09:28

Core Insights - MiniMax has launched its latest M2.5 model, which significantly enhances task processing speed and is competitively priced compared to leading overseas models [1][2] Group 1: Model Performance - The M2.5 model improves task completion speed by 37%, reducing the average time from 31.3 minutes to 22.8 minutes, comparable to Anthropic's Claude Opus 4.6 model [1] - In third-party evaluations, M2.5 scored only 0.4 points lower than Opus 4.6 in programming tasks, while its calling price is just 1/8 of Opus 4.6 [1] - M2.5 demonstrates exceptional performance in long-duration tasks, particularly in programming applications [2] Group 2: Economic Viability - The pricing structure for M2.5 allows for significant cost savings, with the ability to run four agents continuously for a year at a cost of only $10,000 [1] - At a rate of 100 tokens per second, M2.5 costs $1 for one hour of operation, and at 50 tokens per second, it costs $0.3 [1] Group 3: Industry Context - MiniMax's rapid development cycle has seen the release of M2, M2.1, and M2.5 within a span of just over 100 days, showcasing a notable performance improvement compared to other models like Claude, GPT, and Gemini [3] - The launch of M2.5 coincides with a competitive wave among domestic AI model companies, with other firms like ByteDance and DeepSeek also releasing new models around the same time [3]

大模型

智能体强化学习

Artificial Intelligence

Artificial Intelligence

MiniMax M2.5

GLM - 5

Claude Opus 4.6

NeurIPS 2025 | 中科大、港中深、通义千问联合发布CoRT：仅30个样本教会大模型高效推理，token消耗降低50%

机器之心· 2025-11-12 13:23

Core Insights - The article discusses the advancements in large reasoning models (LRMs) like OpenAI-o1, Qwen3, and DeepSeek-R1, which excel in complex reasoning tasks but struggle with precise mathematical calculations [2] - A new framework called CoRT (Code-Optimized Reasoning Training) is introduced, aimed at enhancing the efficiency of large language models by teaching them to effectively utilize code tools for reasoning [3][8] Group 1: Challenges in Current Models - Current models face cognitive conflicts between probabilistic reasoning and deterministic knowledge from external tools, leading to inefficiencies [4] - Models often engage in lengthy natural language reasoning before verifying results with code, resulting in delayed calculations and unnecessary distrust in code outputs [4] - There is a scarcity of high-quality training data for the new "model-tool" collaborative reasoning paradigm, posing a significant challenge [4] Group 2: CoRT Framework Overview - CoRT aims to reshape the interaction between models and tools, transitioning from inefficient verification to efficient computation [8][16] - The framework employs a three-step approach: data cold start, intelligent agent tuning, and advanced training processes [8] Group 3: Hint-Engineering Strategy - Hint-Engineering is introduced as a novel data synthesis strategy to generate high-quality interaction data, correcting inefficient model behaviors at critical decision points [9] - By strategically injecting guiding prompts, the model can be directed to simplify reasoning through code, enhancing efficiency [10][11] Group 4: Multi-Stage Training Process - CoRT incorporates a comprehensive training pipeline consisting of Supervised Fine-Tuning (SFT), Reject Sampling Fine-Tuning (RFT), and Reinforcement Learning (RL) [13] - Initial fine-tuning with high-quality samples allows the model to learn efficient interaction patterns, while RFT filters out poor trajectories to reinforce good behaviors [13] - The RL component enables the model to autonomously learn optimal tool usage strategies through interaction with the code interpreter [13] Group 5: Performance and Efficiency Gains - CoRT has been evaluated on five challenging mathematical reasoning benchmarks, demonstrating significant performance improvements [14] - The framework achieved a 4% absolute accuracy increase for the DeepSeek-R1-32B model and up to an 8% increase for the 1.5B model, outperforming many data-intensive models [20] - Token consumption was reduced by approximately 30% for the 32B model and an impressive 50% for the 1.5B model compared to baseline models [20] Group 6: Implications and Future Directions - The introduction of CoRT provides a new pathway for addressing the shortcomings of large language models in precise reasoning tasks, showcasing the potential for more powerful and reliable AI systems [16][17] - Future research will focus on expanding the framework to incorporate a wider variety of tools and more complex task scenarios [17]

AEPO：智能体熵平衡策略优化，让探索更稳，推理更深！

机器之心· 2025-11-01 04:22

Core Insights - The article discusses the development of Agentic Entropy-Balanced Policy Optimization (AEPO), a new algorithm aimed at balancing exploration and stability in multi-round reinforcement learning for intelligent agents [2][10][11]. Group 1: Algorithm Overview - AEPO addresses the issues of "high-entropy rollout collapse" and "high-entropy gradient clipping" in existing Agentic RL methods, proposing two core mechanisms: dynamic entropy-balanced rollout sampling and entropy-balanced policy optimization [2][11]. - The algorithm has shown significant improvements over seven mainstream reinforcement learning algorithms across 14 cross-domain benchmarks, particularly in deep search tasks [4][12]. Group 2: Performance Metrics - AEPO achieved a Pass@5 score of 61.5% in deep search tasks, outperforming other methods such as ARPO and GRPO by an average of 5.8% [36][37]. - The algorithm maintains training stability while enhancing sampling diversity and reasoning efficiency, providing a new optimization paradigm for scalable reinforcement training of general intelligent agents [4][12]. Group 3: Research Motivation - The motivation behind AEPO is to find a balance in high-entropy environments, where excessive exploration can lead to instability and local optima [8][10]. - The research highlights the dual contradiction of high-entropy signals, which are necessary for exploration but can disrupt resource allocation and hinder learning [14][20]. Group 4: Future Directions - Future research may expand AEPO to multi-modal inputs, complex tool ecosystems, and multi-agent reinforcement learning scenarios to enhance collaborative strategies and performance [41].

只需1/4预算，性能反超基线：阿里高德提出Tree-GRPO，高效破解智能体RL难题

机器之心· 2025-10-13 23:56

Core Insights - The article discusses the Tree-GRPO method proposed by Alibaba Gaode, which enhances reinforcement learning (RL) for agents by transforming independent chain sampling into tree search at the agent step level, addressing high rollout costs and sparse reward signals [2][4][23]. Group 1: Agentic RL Challenges - Agentic RL faces two main challenges: high rollout costs involving thousands of tokens and tool calls, and sparse supervision signals that only evaluate the final reward, making it difficult to identify which actions contributed to success or failure [12][19]. - Existing tree search RL methods typically operate at the token or sentence level, which is not suitable for agents with clear step-level semantic structures [8][19]. Group 2: Tree-GRPO Methodology - The Tree-GRPO method uses "agent steps" as tree nodes, where each node corresponds to a complete think-action-observe step, allowing for more effective trajectory sampling within a given budget [6][8]. - The method initializes multiple independent trajectories and samples nodes to expand the tree, ultimately generating diverse agent trajectories under the same rollout budget [8][19]. Group 3: Performance and Results - In experiments across 11 knowledge-intensive question-answering tasks, Tree-GRPO consistently outperformed chain-based RL methods, achieving significant performance improvements, such as a 69% relative increase in multi-hop QA performance on the smaller Qwen2.5-1.5b model [15][19]. - The method demonstrated a 112% improvement over chain-based methods under extremely limited budget conditions, showcasing its efficiency [19][20]. Group 4: Future Directions - The Tree-GRPO algorithm presents a new approach to Agentic RL, effectively addressing the issues of high rollout budgets and sparse supervision signals, leading to more efficient and stable RL training in multi-turn agent tasks [23][24]. - The team emphasizes the importance of dynamically adjusting the balance between exploration and exploitation in RL training to optimize learning outcomes [24].