Core Viewpoint - The traditional AI benchmarking is rapidly losing effectiveness as many models achieve perfect scores, leading to a lack of differentiation and guidance in evaluation [1][2]. Group 1: Introduction of xbench - Sequoia China launched a new AI benchmark test called xbench, aiming to create a more scientific and long-lasting evaluation system that reflects the objective capabilities of AI [2][3]. - xbench is the first benchmark initiated by an investment institution, collaborating with top universities and research institutions, utilizing a dual-track evaluation system and an evergreen evaluation mechanism [2][3]. Group 2: Features of xbench - xbench employs a dual-track evaluation system to track both the theoretical capability limits of models and the practical value of AI systems in real-world applications [3][4]. - The evergreen evaluation mechanism ensures that the testing content is continuously maintained and updated to remain relevant and timely [3][4]. - The initial release includes two core evaluation sets: xbench-ScienceQA and xbench-DeepSearch, along with a ranking of major products in these fields [4][10]. Group 3: Addressing Core Issues - Sequoia China identified two core issues: the relationship between model capabilities and actual AI utility, and the loss of comparability in AI capabilities over time due to frequent updates in the question bank [5][6]. - xbench aims to break away from conventional thinking by developing novel task settings and evaluation methods that align with real-world applications [6][7]. Group 4: Dynamic Evaluation Mechanism - xbench plans to establish a dynamic evaluation mechanism that collects live data from real business scenarios, inviting industry experts to help build and maintain the evaluation sets [9][8]. - The design includes horizontally comparable capability metrics to observe development speed and key breakthroughs over time, aiding in determining when an agent can take over existing business processes [9][8]. Group 5: Community Engagement - xbench encourages community participation, allowing developers and researchers to use the latest evaluation sets to validate their products and contribute to the development of industry-specific standards [11][10].
速递|红杉中国进军AI测评赛道:xbench为何要“摆脱智力题”考察AI的真实效用?
Z Potentials·2025-05-27 02:37