中文大模型基准测评2025年年度报告-SuperCLUE

Core Insights - The Chinese large model sector is experiencing accelerated development in 2025, with the SuperCLUE annual evaluation covering 23 representative models from both domestic and international sources, focusing on general capabilities, specialized tasks, and application scenarios [1][2]. Group 1: Model Performance - The top-ranking closed-source model is Anthropic's Claude-Opus-4.5-Reasoning, scoring 68.25, followed by Google Gemini-3-Pro-Preview and OpenAI GPT-5.2 (high) [1][23]. - Domestic models are transitioning from "catching up" to "running alongside," with Kimi-K2.5-Thinking (61.50) and Qwen3-Max-Thinking (60.61) ranking fourth and sixth globally, excelling in code generation and mathematical reasoning tasks [1][2][23]. - The performance gap in precise instruction adherence and hallucination control remains significant, with average score differences exceeding 7 points and nearly 2 points, respectively [2]. Group 2: Technological Evolution - The evolution of technology is characterized by three stages: early competition among numerous models and the emergence of multimodal capabilities, a mid-stage explosion of multimodal applications and reasoning breakthroughs, and the rise of intelligent agents and ecosystem reconstruction by 2025 [1][2]. - The mixed expert (MoE) architecture has become mainstream, with domestic open-source models capturing a significant share of the global market, led by DeepSeek and Qwen3 [1][2]. Group 3: Application and Cost-Effectiveness - In application scenarios, general intelligent agents are still in their foundational stages, lacking in complex task handling capabilities; however, domestic models excel in multimodal areas such as image-to-video generation and Chinese adaptation [2]. - Domestic models demonstrate significant cost-effectiveness, with Kimi-K2.5-Thinking priced at only one-third of similar overseas models, although overseas models outperform in reasoning efficiency [2]. Group 4: Future Directions - The Chinese large model sector has made significant advancements in technological innovation, application deployment, and ecosystem construction, establishing core competitive advantages in open-source ecosystems, vertical applications, and cost-effectiveness [2]. - Future efforts should focus on overcoming shortcomings in precise instruction adherence and hallucination control to drive technology towards more efficient and reliable outcomes [2].