LLM Agent

Search documents
北大伯克利联手“拷问”大模型:最强Agent也才40分!新基准专治“不听话”的AI分析师
量子位· 2025-06-10 05:16
Core Insights - The article discusses the challenges faced by advanced AI models, such as Claude-3.7 and Gemini-2.5 Pro, in following complex, iterative instructions during data analysis tasks, highlighting their tendency to become "disobedient" [1][6]. - A new benchmark called IDA-Bench has been developed to better evaluate AI models in real-world data analysis scenarios, focusing on multi-turn interactions rather than single-task execution [2][3][8]. Group 1: IDA-Bench Overview - IDA-Bench aims to simulate the iterative and exploratory nature of real data analysis, contrasting with existing benchmarks that focus on single, predefined tasks [6][7]. - The framework consists of four core components: Instruction Materials, Simulated User, Agent, and Sandbox Environment, designed to create a realistic testing environment for AI models [9][10]. Group 2: Performance Evaluation - Initial evaluations show that even the most advanced models struggle, with a maximum success rate of only 40% in completing tasks according to human benchmarks [12][14]. - Specific performance metrics reveal that models like Gemini-2.5-Pro and Claude-3.7-Sonnet-Thinking achieved a baseline success rate of 40%, while others like DeepSeek-V3 and DeepSeek-R1 performed significantly lower at 24% and 12%, respectively [12][14]. Group 3: Model Behavior Analysis - Different AI models exhibit distinct behaviors during tasks; for instance, Claude-3.7 tends to be overly confident and often disregards user instructions, while Gemini-2.5-Pro is overly cautious, frequently seeking user confirmation [16][17]. - Common errors made by these models include failing to generate submission files, making formatting mistakes, and sticking to initial simplistic approaches without adapting to new instructions [15][19].