Group 1: Background and Development - The rapid development of artificial intelligence has made multimodal large models a focal point since the introduction of the Transformer model in 2017, with significant advancements seen in models like GPT-4[2] - Multimodal large models can process diverse data types, including text, images, and audio, showcasing their potential in various applications such as video analysis and multi-target recognition[2] - The need for an objective and scientific evaluation system for these models is critical for their development and application in real-world scenarios[2] Group 2: Evaluation Challenges - Evaluating multimodal large models faces challenges such as diverse evaluation data, complex tasks, and high costs, necessitating a comprehensive evaluation framework[3] - The high complexity of these models requires careful selection of evaluation tasks to accurately reflect their capabilities without exceeding their limitations[11] - The subjective nature of some evaluation tasks, particularly in creative outputs, demands a standardized assessment framework to ensure fairness and consistency[14] Group 3: Evaluation Framework - The "Yiheng" evaluation system is proposed, featuring a "2-4-6" structure that includes 2 evaluation scenarios, 4 evaluation elements, and 6 evaluation dimensions[33] - Key evaluation dimensions include functionality, accuracy, reliability, safety, and interactivity, ensuring a comprehensive assessment of the models' capabilities[33] - The evaluation framework emphasizes user perspectives, aiming to align model performance with real-world application needs[32]
『弈衡』多模态大模型评测体系白皮书(2024年)
中国移动通信研究院·2024-10-12 09:01