SWEBench

Search documents
Claude Just Got a Big Update (Opus 4.1)
Matthew Berman· 2025-08-05 23:02
Model Release & Performance - Anthropic 发布了 Claude Opus 4.1%,是对 Claude Opus 4 的升级,尤其在 Agentic 任务、真实世界编码和推理方面 [1] - SWEBench verified 基准测试中,Claude Opus 4.1% 的得分从 Opus 4 的 72.5% 提升至 74.5%,提升了 2 个百分点 [3] - Terminal Bench 基准测试中,Claude Opus 4.1% 的终端使用能力从 39.2% 提升至 43.3%,提升了 4.1 个百分点 [4] - GPQA Diamond(研究生水平推理)基准测试中,Claude Opus 4.1% 的得分从 79.6% 提升至 80.9%,提升了 1.3 个百分点 [4] - Towbench(Agentic 工具使用)基准测试中,Claude Opus 4.1% 在零售方面的得分从 81.4% 提升至 82.4%,提升了 1 个百分点,但在航空方面从 59.6% 下降至 56%,下降了 3.6 个百分点 [5] - 多语言问答基准测试中,Claude Opus 4.1% 的得分从 88.8% 提升至 89.5%,提升了 0.7 个百分点 [5] - Amy 2025 基准测试中,Claude Opus 4.1% 的得分提升了 2.5 个百分点至 78% [5] Competitive Positioning & Future Outlook - 在 SWEBench 和 Terminal Bench 基准测试中,Claude Opus 4.1% 优于 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [5] - 在 GPQA Diamond 和 Agentic 工具使用基准测试中,Claude Opus 4.1% 不及 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [6] - 在高中数学竞赛基准测试中,Claude Opus 4.1% 的得分低于 OpenAI 的 GPT-3 (88.9%) 和 Gemini 1.5 Pro (88%),仅为 78% [6] - Claude 目前被广泛认为是市场上最佳的编码模型,尤其擅长 Agentic 编码和 Agent-driven 开发 [7]
China Went HARD...
Matthew Berman· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]