Seek .-DeepSeek新版R1模型实际性能如何？第三方评测来了

Core Insights - DeepSeek has released an upgraded version of its R1 model, which shows improved performance compared to its predecessor and surpasses OpenAI's o3 model, although it still lags behind o4-mini(high) and Google's Gemini 2.5 Pro Preview 05-06 [1][2] Model Performance - The new R1 model achieved a total score of 63.55, an increase of 1.61 points from the previous version, placing it fourth in the rankings [2] - The highest score was obtained by o4-mini(high) at 70.51, followed by Gemini 2.5 Pro preview 05-06 at 66.48 [2] Reasoning and Instruction Following - The instruction-following capability of the new R1 model improved significantly, scoring 48.46, which is 17.09 points higher than the old version, but still falls short of international top models like o3 (66.95) and o4-mini(high) (68.07) [4] - The reasoning task scores showed a decline of 1.7 points compared to the old R1 model, with the main differences observed in mathematical and scientific reasoning tasks, while performing better in coding tasks [4] Reduction in Hallucination Rate - The updated R1 model has optimized its performance regarding "hallucination" issues, with a reduction in hallucination rates by approximately 45%-50% in tasks such as rewriting, summarization, and reading comprehension [4] - The hallucination rate for the new R1 model is now at 13.86%, a decrease of 7.16 percentage points, although it still has a significant gap compared to the best-performing model, doubao-1.5-pro-32k, which has a hallucination rate of only 4.11% [5] - The most notable improvements in hallucination rates were observed in text summarization and reading comprehension tasks, with reductions of 9.27% and 14.49%, respectively [5]