杜克大学、Zoom推出LiveMCP‑101：GPT‑5表现最佳但未破60%，闭源模型Token效率对数规律引关注

Core Insights - The article discusses the introduction of LiveMCP-101, the first evaluation benchmark specifically designed for MCP-enabled Agents in real dynamic environments, consisting of 101 meticulously crafted tasks across various domains such as travel planning, sports entertainment, and software engineering [2][5][27] - The study reveals that even the most advanced models have a success rate of less than 60% on this benchmark, highlighting significant challenges faced by current LLM Agents in practical deployment [2][5][27] Research Background and Motivation - The emergence of external tool interaction capabilities has become central to AI Agents, allowing them to engage dynamically with the real world [5] - Existing benchmarks are limited as they focus on single-step tool calls and synthetic environments, failing to capture the complexity and dynamism of real-world scenarios [5] - User queries in reality often involve detailed context and specific constraints, necessitating precise reasoning across multiple tool calls [5] Evaluation Framework - The benchmark includes 101 high-quality tasks, covering 41 MCP servers and 260 tools, categorized into Easy, Medium, and Hard difficulty levels [6] - A Reference Agent mechanism is established to ensure stable and reproducible results by strictly following predefined execution plans [9] - A dual scoring mechanism is employed, utilizing LLM-as-judge to assess both the results and execution trajectories of the tested agents [11] Key Findings - Among 18 evaluated models, GPT-5 leads with a 58.42% overall success rate, while performance significantly declines with task difficulty [14] - The study identifies a strong correlation between execution quality and task success rates, emphasizing the importance of "process correctness" [17] - Systematic failure modes are categorized into three main types, with planning and orchestration errors being the most prevalent [20] Comparison with Existing Work - LiveMCP-101 offers a more realistic assessment by incorporating a larger tool pool and interference tools, exposing robustness issues under long contexts and selection noise [23] - The benchmark's detailed execution plans and scoring methods provide a clearer differentiation among model capabilities [24] - The framework allows for precise identification of errors in planning, parameters, or post-processing, guiding engineering optimizations [25]