从组件到系统，Agent 的 Evaluation 怎么做？

Core Insights - The article discusses the evolution of Agent Evaluation in AI, emphasizing the need for new assessment benchmarks as AI systems transition from passive language models (LLMs) to autonomous AI agents capable of planning and interacting with digital environments [3][4][5]. Group 1: Agent Evaluation Challenges - The complexity of evaluating agents arises from the need to measure their end-to-end success rates, reliability, and efficiency in dynamic environments, unlike traditional LLM evaluations which focus on static outputs [5][6]. - The evaluation of agents must consider their interactions with the environment and the emergent properties that arise from these interactions, rather than just the quality of text output [7][8]. Group 2: Evolution of Evaluation Paradigms - The evolution of Agent Evaluation paradigms reflects the increasing complexity and application scope of AI systems, with each generation of benchmarks designed to address the limitations of the previous ones [9][10]. - The article outlines a comparison of different evaluation generations, highlighting the shift from static assessments of LLMs to dynamic evaluations of agents that can operate in real-world scenarios [10][11]. Group 3: Key Evaluation Frameworks - New frameworks such as GAIA, MCP-universe, MCPMark, and MCP-AgentBench have emerged to address the unique challenges of Agent Evaluation, focusing on dynamic interactions and the ability to perform tasks in real-time [8][10]. - The core value of an agent is defined by its autonomy, planning capabilities, and interaction with the environment, necessitating evaluation methods that can measure these action-oriented competencies [11].