xbench评测集正式开源

Core Insights - The article introduces xbench, an open-source AI benchmarking tool aimed at quantifying the effectiveness of AI systems in real-world scenarios and utilizing a long-term evaluation mechanism [1] - The launch of xbench has generated significant interest from both large enterprises and startups, with increasing demand for product testing using the xbench evaluation sets [1] - The initiative aims to foster collaboration within the AI community by providing transparent and open-source resources [1] Group 1: xbench Evaluation Sets - The xbench-ScienceQA evaluation set focuses on high-quality, multi-disciplinary questions sourced from top academic institutions and industry experts, addressing the limitations of existing benchmarks [2] - The average accuracy of the xbench-ScienceQA set is 32%, with one-third of the questions having an accuracy below 20%, indicating a high level of difficulty and differentiation among models [12][10] - The xbench-DeepSearch evaluation set is designed to assess the deep search capabilities of AI agents, emphasizing the need for comprehensive planning, searching, reasoning, and summarization skills [3] Group 2: Evaluation Methodology - The xbench-ScienceQA set includes 77 Q&A questions, 14 multiple-choice questions, and 9 single-choice questions, with a focus on reducing the impact of single-choice questions on scoring [8] - The question construction process for both evaluation sets involves rigorous validation to ensure the uniqueness and correctness of answers, with a focus on avoiding easily searchable content [6][13] - Both evaluation sets will be continuously updated, with monthly performance reports and quarterly updates to maintain relevance and accuracy [2][3] Group 3: Community Engagement - The article encourages AI enthusiasts, model developers, and researchers to participate in the ongoing development and testing of AI technologies through xbench [31] - Contact information is provided for those interested in contributing to the evaluation sets or seeking feedback on their models [32]