Core Insights - The xbench-DeepSearch evaluation set has been upgraded with a new set of 100 questions, demonstrating significant advantages for ChatGPT-5 Pro, which leads the evaluation scores distinctly [1][2][3] - The DeepSearch-2510 question bank has been open-sourced, allowing for broader access and evaluation [1][2] Evaluation Results - ChatGPT-5 Pro achieved an accuracy score of 75+, with a cost per task of approximately $0.085 and a time cost of 5-8 minutes [3] - SuperGrok Expert ranked second with an accuracy of 40+, costing around $0.08 per task and taking 3-5 minutes [3] - Other agents, such as Minimax and StepFun, scored around 35+, with varying costs and time requirements [3][19] User Experience Insights - The evaluation highlights the importance of accuracy, response time, and cost in user experience, with acceptable thresholds being under $0.25 per task and response times within 8 minutes [6][4] - Several agents, including ChatGPT-5 Pro and SuperGrok Expert, fall within the optimal user experience range [6] Updates and Improvements - The new DeepSearch-2510 version increases difficulty and includes more multimodal questions, requiring agents to interpret images or videos [9] - The update also incorporates questions that necessitate dynamic interaction with web sources, reflecting advancements in agent capabilities [9] Performance Analysis - ChatGPT-5 Pro's leading performance is attributed to its reduced hallucination rate and enhanced tool usage capabilities, allowing for better source verification and response accuracy [12][13] - SuperGrok's strong performance is linked to the advantages of the Grok-4 model, which enhances reasoning capabilities [14] Competitive Landscape - Domestic agents generally score between 30-40, showing no significant differentiation due to foundational model capabilities [19] - The performance of various agents has improved significantly over recent months, with notable advancements in ChatGPT and SuperGrok due to model updates [16][17]
DeepSearch题库和榜单更新,最新题库已开源|xbench月报
红杉汇·2025-10-27 00:04