中文大模型
Search documents
中文大模型基准测评2025年年度报告-SuperCLUE
Sou Hu Cai Jing· 2026-02-05 07:35
Core Insights - The Chinese large model sector is experiencing accelerated development in 2025, with the SuperCLUE annual evaluation covering 23 representative models from both domestic and international sources, focusing on general capabilities, specialized tasks, and application scenarios [1][2]. Group 1: Model Performance - The top-ranking closed-source model is Anthropic's Claude-Opus-4.5-Reasoning, scoring 68.25, followed by Google Gemini-3-Pro-Preview and OpenAI GPT-5.2 (high) [1][23]. - Domestic models are transitioning from "catching up" to "running alongside," with Kimi-K2.5-Thinking (61.50) and Qwen3-Max-Thinking (60.61) ranking fourth and sixth globally, excelling in code generation and mathematical reasoning tasks [1][2][23]. - The performance gap in precise instruction adherence and hallucination control remains significant, with average score differences exceeding 7 points and nearly 2 points, respectively [2]. Group 2: Technological Evolution - The evolution of technology is characterized by three stages: early competition among numerous models and the emergence of multimodal capabilities, a mid-stage explosion of multimodal applications and reasoning breakthroughs, and the rise of intelligent agents and ecosystem reconstruction by 2025 [1][2]. - The mixed expert (MoE) architecture has become mainstream, with domestic open-source models capturing a significant share of the global market, led by DeepSeek and Qwen3 [1][2]. Group 3: Application and Cost-Effectiveness - In application scenarios, general intelligent agents are still in their foundational stages, lacking in complex task handling capabilities; however, domestic models excel in multimodal areas such as image-to-video generation and Chinese adaptation [2]. - Domestic models demonstrate significant cost-effectiveness, with Kimi-K2.5-Thinking priced at only one-third of similar overseas models, although overseas models outperform in reasoning efficiency [2]. Group 4: Future Directions - The Chinese large model sector has made significant advancements in technological innovation, application deployment, and ecosystem construction, establishing core competitive advantages in open-source ecosystems, vertical applications, and cost-effectiveness [2]. - Future efforts should focus on overcoming shortcomings in precise instruction adherence and hallucination control to drive technology towards more efficient and reliable outcomes [2].
DeepSeek新版R1模型实际性能如何?第三方评测来了
Nan Fang Du Shi Bao· 2025-06-05 12:26
Core Insights - DeepSeek has released an upgraded version of its R1 model, which shows improved performance compared to its predecessor and surpasses OpenAI's o3 model, although it still lags behind o4-mini(high) and Google's Gemini 2.5 Pro Preview 05-06 [1][2] Model Performance - The new R1 model achieved a total score of 63.55, an increase of 1.61 points from the previous version, placing it fourth in the rankings [2] - The highest score was obtained by o4-mini(high) at 70.51, followed by Gemini 2.5 Pro preview 05-06 at 66.48 [2] Reasoning and Instruction Following - The instruction-following capability of the new R1 model improved significantly, scoring 48.46, which is 17.09 points higher than the old version, but still falls short of international top models like o3 (66.95) and o4-mini(high) (68.07) [4] - The reasoning task scores showed a decline of 1.7 points compared to the old R1 model, with the main differences observed in mathematical and scientific reasoning tasks, while performing better in coding tasks [4] Reduction in Hallucination Rate - The updated R1 model has optimized its performance regarding "hallucination" issues, with a reduction in hallucination rates by approximately 45%-50% in tasks such as rewriting, summarization, and reading comprehension [4] - The hallucination rate for the new R1 model is now at 13.86%, a decrease of 7.16 percentage points, although it still has a significant gap compared to the best-performing model, doubao-1.5-pro-32k, which has a hallucination rate of only 4.11% [5] - The most notable improvements in hallucination rates were observed in text summarization and reading comprehension tasks, with reductions of 9.27% and 14.49%, respectively [5]