SuperCLUE最新评测:文心X1.1精确指令遵循国内第一
BIDUBIDU(US:BIDU) Cai Jing Wang·2025-10-21 07:42

Core Insights - The SuperCLUE-CPIF evaluation benchmark was officially released on October 21, with the Wenxin X1.1 model scoring 75.51, ranking first among domestic large models [1] - The evaluation included 10 domestic and international models, focusing on the ability of large language models (LLMs) to accurately follow complex and multi-constraint instructions in a Chinese context [1] - The Wenxin X1.1 model demonstrated significant advantages in practical production environments, outperforming competitors such as DeepSeek-V3.2-Exp-Thinking and Hunyuan-T1-20250822, which scored 73.98 and 65.82 respectively [1] Group 1 - The Wenxin X1.1 model is based on the Wenxin large model 4.5 and utilizes an iterative mixed reinforcement learning training framework to enhance performance on both general and agent tasks [1] - The model employs self-distillation data for iterative production and training, continuously improving overall effectiveness [1] Group 2 - Wenxin X1.1 excels in handling complex writing tasks by leveraging internal knowledge and online search tools to accurately meet user needs while maintaining creativity and logical structure [2] - In scenarios involving complex long-term tasks, the model effectively manages various user emotions and issues by adhering to business processes and autonomously utilizing tools [2] - Compared to the previous Wenxin model, the Wenxin X1.1 has shown a 34.8% improvement in factual accuracy, a 12.5% increase in instruction adherence, and a 9.6% enhancement in agent performance [2]