DeepSeek模型升级

Core Viewpoint - DeepSeek has released an updated version, DeepSeek-V3.1-Terminus, which improves upon the previous model based on user feedback while maintaining its original capabilities [1][2]. Group 1: Model Improvements - The output stability of DeepSeek-V3.1-Terminus has been enhanced compared to the previous version, addressing issues such as mixed language consistency and occasional abnormal characters [2]. - The performance of the Code Agent and Search Agent has been further optimized in the new model [2]. Group 2: Benchmark Results - The benchmark results for DeepSeek-V3.1-Terminus show improvements in various assessments: - MMLU-Pro: increased from 84.8 to 85.0 - GPQA-Diamond: increased from 80.1 to 80.7 - Humanity's Last Exam: increased significantly from 15.9 to 21.7 - LiveCodeBench: slightly increased from 74.8 to 74.9 - Codeforces: decreased from 2091 to 2046 - Aider-Polyglot: decreased from 76.3 to 76.1 - BrowseComp: increased from 30.0 to 38.5 - BrowseComp-zh: decreased from 49.2 to 45.0 - SimpleQA: increased from 93.4 to 96.8 - SWE Verified: increased from 66.0 to 68.4 - SWE-bench Multilingual: increased from 54.5 to 57.8 - Terminal-bench: increased from 31.3 to 36.7 [3]. Group 3: Availability - The DeepSeek-V3.1-Terminus update has been synchronized across the official app, web version, mini-program, and DeepSeek API model [4].