红杉中国，10天发两篇Paper

Core Insights - Sequoia China and Unipat AI have launched a significant update to the xbench evaluation framework, introducing the BabyVision assessment to evaluate the pure visual understanding capabilities of large models, indicating substantial future potential in world models and visual multimodality [2] - The new AgentIF-OneDay evaluation system measures the ability of agents to solve complex long-term tasks, moving beyond simple knowledge assessment to evaluate performance in real-world scenarios [2][3] Evaluation Framework - The AgentIF-OneDay framework explores the transition from one-hour to one-day capabilities, revealing the true performance of mainstream agents in workflow execution, implicit inference, and iterative editing [3] - The evaluation aims to observe the evolution of industry technology routes and predict the upper limits of model capabilities, focusing on utility and economic value [7][8] Agent Capabilities - The evolution of agent capabilities is expected to follow two main lines: scaling context and scaling domain, which determine the complexity of tasks agents can handle [8][9] - Scaling context refers to the extension of tasks over time, requiring agents to maintain context and consistency over longer execution periods [8] - Scaling domain involves expanding the types of tasks agents can perform, moving beyond highly structured tasks to those that span multiple domains and contexts [9] Task Complexity - The AgentIF-OneDay evaluation uses the complexity of tasks that can be completed within a day as a benchmark, testing agents' abilities to complete tasks without human intervention across diverse domains [12] - Analysis of user work logs indicates that daily tasks can be categorized into three types: workflow execution, example reference, and iterative editing [13] Task Types - Workflow execution involves agents executing known processes accurately, while example reference requires agents to infer intent from provided examples [14][15] - Iterative editing tasks require agents to maintain context and adapt to changing requirements through multiple interactions [16] Evaluation Results - The AgentIF framework has tested existing mainstream agent systems, revealing that Manus, Genspark, and ChatGPT-Agent are currently the top performers, with overall task success rates between 0.62 and 0.65 [20] - ChatGPT is identified as the best productivity tool, Manus as the best life assistant, and Genspark as the best study partner, highlighting different strengths across various task domains [21][22] Future Directions - The development of the OneWeek evaluation set is underway, aiming to challenge agents with tasks that require a week’s worth of human work, indicating a significant step towards agents taking on real job responsibilities [24] - The transition to OneWeek tasks will necessitate more stringent evaluation criteria and the ability for agents to learn and adapt in real-world environments [25][26] - The accumulation of user data is crucial for enhancing agent reliability and performance in long-term tasks, similar to the evolution of autonomous driving technology [27]