100+大模型综测结果出炉！智源发布FlagEval“百模”评测结果，覆盖文本语音图片视频多种模态

Core Insights - The latest comprehensive evaluation results of over 100 language and multimodal models were released by the Zhiyuan Research Institute, indicating a shift towards enhancing comprehensive capabilities and practical applications in the second half of 2024 [2][4]. Model Performance - In the subjective evaluation of language models, ByteDance's Doubao-pro-32k-preview and Baidu's ERNIE 4.0 Turbo ranked first and second, respectively, showcasing strong Chinese language capabilities [3][6]. - The evaluation revealed that while the performance of language models in general Chinese scenarios has stabilized, there remains a significant gap in complex task performance compared to top international models [6][9]. Multimodal Model Evaluation - The evaluation of visual language models showed that domestic models are narrowing the performance gap with leading closed-source models, although there is still room for improvement in long-tail visual knowledge and complex data analysis [10][11]. - In the text-to-image generation category, Tencent's Hunyuan Image ranked first, followed by ByteDance's Doubao image v2.1 and Ideogram 2.0 [12]. K12 Subject Testing - The K12 subject tests indicated that while the overall score of models improved by 12.86% compared to six months ago, they still lag behind the average performance of Haidian students, particularly in science subjects [19][20]. Financial Quantitative Trading - The evaluation of models in financial quantitative trading revealed that leading models are approaching the level of junior quantitative researchers, with Deepseek-chat, OpenAI GPT-4o-2024-08-06, and Google Gemini-1.5-pro-latest ranking as the top three [25][27]. Evaluation Methodology - The FlagEval evaluation system has been iteratively developed to cover over 800 models globally, utilizing more than 200 million evaluation questions across various tasks [27][28]. - The evaluation process has been enhanced to include dynamic updates and increased difficulty of questions to avoid dataset saturation [28][29].