Workflow
AgentAuditor
icon
Search documents
AgentAuditor: 让智能体安全评估器的精确度达到人类水平
机器之心· 2025-06-27 04:02
Core Insights - LLM Agents are evolving from mere text generators to autonomous decision-makers capable of complex task execution, raising safety concerns regarding their interactions [1] - Existing safety evaluation benchmarks for LLM Agents lack effective evaluators, struggling to assess the nuanced risks associated with complex interactions [1] - The introduction of AgentAuditor, a framework developed by researchers from multiple universities, aims to enhance the safety evaluation of LLM Agents to human expert levels [2] Evaluation Challenges - Traditional LLM safety assessments excel in content generation evaluation but fail to address the complexities of agent interactions and decision-making processes [1] - Current evaluation methods, whether rule-based or model-based, face challenges in accurately identifying subtle risks and understanding ambiguous rules [1] AgentAuditor Framework - AgentAuditor combines structured memory and retrieval-augmented reasoning (RAG) to enhance LLM evaluators' ability to learn and understand complex interaction records [4] - The framework operates through three key stages: 1. Feature Memory Construction transforms raw interaction records into a structured database containing deep semantic information [4] 2. Reasoning Memory Construction selects representative cases to generate high-quality reasoning chains that guide subsequent evaluations [5] 3. Memory-Augmented Reasoning dynamically retrieves relevant reasoning experiences to assist LLM evaluators in making precise judgments [6] ASSEBench Dataset - ASSEBench is a newly created benchmark designed to validate AgentAuditor's capabilities, consisting of 2,293 meticulously annotated real agent interaction records [9] - The benchmark covers 15 risk types, 528 interaction environments, and spans 29 application scenarios, ensuring comprehensive evaluation [9] - It employs a human-machine collaborative annotation process with strict and lenient judgment standards for nuanced risk assessment [9] Experimental Results - Extensive experiments demonstrate that AgentAuditor significantly improves LLM evaluators' performance across various datasets, achieving human-level accuracy [10][11] - For instance, the Gemini-2-Flash-Thinking model saw an F1 score increase of up to 48.2% on ASSEBench-Safety, nearing human-level performance [12] - AgentAuditor's adaptive capabilities allow it to adjust reasoning strategies based on different evaluation standards, effectively narrowing performance gaps among models [12] Conclusion - The introduction of AgentAuditor and ASSEBench provides robust evaluation tools and research foundations for building more trustworthy LLM Agents [17] - This advancement not only propels the development of LLM evaluators but also guides the future construction of safer and more reliable agent defense systems [17]