红杉中国推出 Agent 基准测试「xbench」，双轨评估体系，关注 AI 真实场景的效用

Core Insights - Sequoia China has launched an internal AI and Agent benchmarking tool called "xbench" and published a corresponding paper titled "xbench: Tracking Agents Productivity, Scaling with Profession-Aligned Real-World Evaluations" [1][2] Group 1: xbench Overview - xbench employs a dual-track evaluation system to construct multidimensional assessment datasets, aiming to track both the theoretical capabilities of AI systems and the practical utility value of Agents in real-world applications [5][19] - The initial release includes two core assessment sets: xbench-ScienceQA for scientific question answering and xbench-DeepSearch for deep search capabilities, along with comprehensive rankings of major products in these fields [5][25] Group 2: Evaluation Methodology - The xbench evaluation system is designed to address two core questions: the relationship between model capabilities and actual AI utility, and the comparability of capabilities across different time dimensions [10][11] - The evaluation framework is dynamic, incorporating real-world application needs and continuously updating assessment content to ensure relevance and timeliness [5][17] Group 3: AGI Tracking and Profession Aligned Evaluations - xbench distinguishes between AGI Tracking evaluations, which verify whether models exhibit intelligent behavior in specific capability dimensions, and Profession Aligned evaluations, which focus on the delivery results and commercial value in real-world scenarios [19][20] - The AGI Tracking assessments are foundational, while Profession Aligned evaluations represent advanced practices that align with actual business processes [19][20] Group 4: Future Directions - The company plans to expand the evaluation framework to include more professional fields such as finance, law, and sales, inviting industry experts to co-develop the assessment tasks [36][37] - The long-term goal is to create a sustainable evaluation ecosystem that adapts to the rapid evolution of AI capabilities and market needs, ensuring that assessments remain relevant and effective [37][39]