Core Insights - The latest xbench-ScienceQA leaderboard has been released, showcasing new models from six companies, with Gemini 3 Pro achieving state-of-the-art (SOTA) performance and DeepSeek V3.2 matching GPT-5.1 in scores while offering high cost-effectiveness [1][2][6] - xbench will introduce two new benchmarks to evaluate agent instruction-following capabilities and multimodal understanding of models [1] Model Performance Summary - Gemini 3 Pro: Scored 71.6, up from 59.4 in Gemini 2.5 Pro, with a BoN of 85. Average response time is 48.62 seconds. Cost for answering 500 questions is approximately $3 [3][6] - DeepSeek V3.2: Achieved a score of 62.6, matching GPT-5.1, with a BoN of 81. The cost for 500 questions is only $2 for the Speciale version and $1.3 for the Thinking version [6] - Claude Opus 4.5: Scored 55.2 with a fast average response time of 13 seconds, showing improvement over its predecessor [6] - Kimi K2 Thinking: Scored 51.8 with a BoN of 76, indicating a slight improvement [6] New Model Developments - DeepSeek V3.2: Introduces a Sparse Attention mechanism to enhance long-context performance while reducing computational complexity. It also features a scalable reinforcement learning framework to improve reasoning and instruction-following capabilities [10][12] - Gemini 3: A new multimodal model from Google DeepMind, excelling in reasoning depth and multimodal understanding, achieving a top score of 1501 Elo in LMArena [13] - Nano Banana Pro: A new image generation model that integrates advanced reasoning capabilities with real-time knowledge, allowing for complex image synthesis [14] - Claude Opus 4.5: A flagship model from Anthropic that excels in code generation and human-computer interaction, achieving high performance in real-world software engineering tasks [15][16] - GPT-5.1: An important iteration from OpenAI that enhances conversational fluency and complex task reasoning, introducing adaptive reasoning mechanisms [17] - Tongyi DeepResearch: Designed for deep research tasks, this model combines mid-training and post-training frameworks to enhance agent capabilities, achieving competitive performance with a smaller model [19]
xbench榜单更新!DeepSeek V3.2追平GPT-5.1|xbench月报
红杉汇·2025-12-05 00:06