Workflow
红杉中国推出 Agent 基准测试「xbench」,双轨评估体系,关注 AI 真实场景的效用
Founder Park·2025-05-26 06:44

文章转载自红杉中国公众号「红杉汇」,内容略有调整。 红杉中国开放了他们内部进行 AI 和 Agent 基准测试的工具「 xbench」,并发布了相应论文《xbench: Tracking Agents Productivity,Scaling with Profession-Aligned Real-World Evaluations》。 论文地址: https://xbench.org/files/xbench_profession_v2.4.pdf TLDR: | Benchmark | Category | 151 B | 8 2nd | g 3rd | Details | | --- | --- | --- | --- | --- | --- | | xbench-ScienceQA | AGI Tracking | 03- high 60.8 | Gemini 2.5 Pro 57.2 | Doubao-1.5-thinking- pro 53.6 | View > | | xbench-DeepSearch | AGI Tracking | 03 65+ | o4-mini-high 60+ | ...