Agent能力评测
Search documents
红杉中国,10天发两篇Paper
投资界· 2026-01-21 02:01
Core Insights - Sequoia China and Unipat AI have launched a significant update to the xbench evaluation framework, introducing the BabyVision assessment to evaluate the pure visual understanding capabilities of large models, indicating substantial future potential in world models and visual multimodality [2] - The new AgentIF-OneDay evaluation system measures the ability of agents to solve complex long-term tasks, moving beyond simple knowledge assessment to evaluate performance in real-world scenarios [2][3] Evaluation Framework - The AgentIF-OneDay framework explores the transition from one-hour to one-day capabilities, revealing the true performance of mainstream agents in workflow execution, implicit inference, and iterative editing [3] - The evaluation aims to observe the evolution of industry technology routes and predict the upper limits of model capabilities, focusing on utility and economic value [7][8] Agent Capabilities - The evolution of agent capabilities is expected to follow two main lines: scaling context and scaling domain, which determine the complexity of tasks agents can handle [8][9] - Scaling context refers to the extension of tasks over time, requiring agents to maintain context and consistency over longer execution periods [8] - Scaling domain involves expanding the types of tasks agents can perform, moving beyond highly structured tasks to those that span multiple domains and contexts [9] Task Complexity - The AgentIF-OneDay evaluation uses the complexity of tasks that can be completed within a day as a benchmark, testing agents' abilities to complete tasks without human intervention across diverse domains [12] - Analysis of user work logs indicates that daily tasks can be categorized into three types: workflow execution, example reference, and iterative editing [13] Task Types - Workflow execution involves agents executing known processes accurately, while example reference requires agents to infer intent from provided examples [14][15] - Iterative editing tasks require agents to maintain context and adapt to changing requirements through multiple interactions [16] Evaluation Results - The AgentIF framework has tested existing mainstream agent systems, revealing that Manus, Genspark, and ChatGPT-Agent are currently the top performers, with overall task success rates between 0.62 and 0.65 [20] - ChatGPT is identified as the best productivity tool, Manus as the best life assistant, and Genspark as the best study partner, highlighting different strengths across various task domains [21][22] Future Directions - The development of the OneWeek evaluation set is underway, aiming to challenge agents with tasks that require a week’s worth of human work, indicating a significant step towards agents taking on real job responsibilities [24] - The transition to OneWeek tasks will necessitate more stringent evaluation criteria and the ability for agents to learn and adapt in real-world environments [25][26] - The accumulation of user data is crucial for enhancing agent reliability and performance in long-term tasks, similar to the evolution of autonomous driving technology [27]
AgentIF-OneDay发布,评估全场景长时复杂任务
红杉汇· 2026-01-21 00:06
Core Insights - The article discusses the advancements in the Agent field, highlighting the impressive performance of large models in short-term tasks while revealing their limitations in long-term tasks. It emphasizes the need for a more scientific evaluation framework to assess the multi-modal understanding and complex problem-solving capabilities of these models [1][4]. Evaluation Framework - The introduction of the AgentIF-OneDay evaluation system aims to measure the ability of agents to solve complex tasks rather than just their knowledge base. This system explores the transition from OneHour to OneDay capabilities, revealing the true performance of mainstream agents in workflow execution, implicit inference, and iterative editing [1][6][10]. - The evaluation framework is designed to observe the evolution of industry technology routes and predict the upper limits of model capabilities, focusing on utility and economic value [6][10]. Task Complexity - Task complexity is defined not by the depth of knowledge or reasoning difficulty but by the human time investment required to complete a task, which correlates with its potential economic and utility value [6][7]. - The evolution of agent capabilities is expected to follow two main axes: scaling context (time dimension of tasks) and scaling domain (task type complexity). These axes determine the upper limits of task complexity that agents can handle [6][7]. Agent Capabilities - The AgentIF-OneDay framework tests agents' abilities to complete a full set of tasks within a day without human intervention, covering diverse domains such as life, learning, and work [10][11]. - Three primary task types are identified: Workflow Execution, Latent Instruction Inference, and Iterative Refinement, each representing different user interaction scenarios [11][14][15]. Testing Results - The evaluation of mainstream agent systems revealed that Manus, Genspark, and ChatGPT-Agent scored between 0.62 and 0.65 in overall task success rates, indicating similar capabilities across different systems [17][18]. - ChatGPT is identified as the best productivity tool for work, Manus as the best life assistant, and Genspark as the best study partner, showcasing the varying strengths of these agents in different domains [18][19]. Future Directions - The article anticipates that by 2026, agents will begin to challenge one-week human workloads, with the development of the OneWeek evaluation set already underway. This will involve more complex tasks and stricter rubric designs [22][23]. - The need for agents to possess active learning capabilities in real or semi-real environments is emphasized, suggesting that future advancements will rely on continuous learning and adaptation rather than static training methods [24][25].