Workflow
弈衡:多模态大模型评测体系白皮书
中国移动·2024-10-12 10:03

Investment Rating - The report does not explicitly provide an investment rating for the industry Core Insights - The rapid development of artificial intelligence, particularly since the introduction of the Transformer model in 2017, has led to the emergence of multimodal large models that can process text, images, and audio, showcasing significant application potential across various sectors [3][4] - The report emphasizes the need for a comprehensive and objective evaluation system for multimodal large models to assess their performance in specific task scenarios, which is crucial for their development and practical application [4][5] - The "Yiheng" evaluation system is proposed to address the challenges in evaluating multimodal large models, focusing on user perspectives and aiming to create a standardized ecosystem for evaluation in the industry [5][36] Summary by Sections 1. Background of Multimodal Large Model Evaluation - The development of multimodal large models has enhanced their ability to process diverse data types, leading to widespread applications in various industries such as content creation, education, finance, healthcare, and smart manufacturing [6][7] - The report identifies the need for an evaluation framework that can objectively assess the performance of these models in different application scenarios, highlighting the challenges posed by the complexity and diversity of evaluation tasks [9][12] 2. Evaluation Technology for Multimodal Large Models - The evaluation methods for multimodal large models include objective assessments using quantitative metrics and subjective evaluations through human scoring, particularly for creative tasks [18][19] - Key evaluation dimensions identified include model performance, generalization ability, robustness, and consistency, with common metrics such as accuracy, F1 score, and BLEU being utilized [21][22] 3. Typical Evaluation Systems for Multimodal Large Models - Various evaluation systems have been developed by research institutions and companies, such as MMBench, OCRBench, and SEED-Bench, each focusing on different aspects of model performance and application scenarios [24][25][26] - These systems aim to provide comprehensive assessments of multimodal large models, addressing both subjective and objective evaluation needs [28][30] 4. "Yiheng" Multimodal Large Model Evaluation System - The "Yiheng" evaluation system is structured around a "2-4-6" framework, which includes two types of evaluation scenarios, four evaluation elements, and six evaluation dimensions, focusing on functionality, performance, reliability, safety, and interactivity [35][36] - The evaluation tasks are categorized into basic tasks and application tasks, with specific examples provided for each category, ensuring a thorough assessment of the models' capabilities in real-world applications [39][41]