Core Viewpoint - The development of artificial intelligence (AI) is entering a new phase where the focus shifts from solving problems to defining them, emphasizing the importance of evaluation standards over training techniques [2][3]. Group 1: Evaluation Framework - A new evaluation framework called "General-Level" has been proposed to assess the capabilities of multimodal large language models (MLLMs), aiming to measure their progress towards artificial general intelligence (AGI) [3][6]. - The General-Level framework categorizes MLLMs into five levels based on their ability to exhibit synergy across different tasks and modalities, with the highest level representing true multimodal intelligence [11][15]. - The framework highlights the need for a unified standard to evaluate "generalist intelligence," addressing the current fragmentation in assessment methods [6][9]. Group 2: General-Bench Testing Set - The General-Bench is a comprehensive multimodal testing set consisting of 700 tasks and approximately 325,800 questions, designed to rigorously evaluate MLLMs across various modalities [19][21]. - This testing set emphasizes open-ended responses and content generation, moving beyond traditional multiple-choice formats to assess models' creative capabilities [24][25]. - The design of General-Bench includes cross-modal tasks that require models to integrate information from different modalities, simulating real-world challenges [24][25]. Group 3: Model Performance Insights - Initial testing results reveal that many leading models, including GPT-4V, exhibit significant weaknesses, particularly in video and audio tasks, indicating a lack of comprehensive multimodal capabilities [23][25]. - Approximately 90% of tested models only reached Level-2 (Silver) in the General-Level framework, demonstrating limited synergy and generalization across tasks [27][28]. - No models have yet achieved Level-5 (King) status, highlighting the ongoing challenges in achieving true multimodal intelligence and the need for further advancements [28][29]. Group 4: Community Response and Future Outlook - The introduction of General-Level and General-Bench has garnered positive feedback from both academic and industrial communities, with recognition at major conferences [35][36]. - The open-source nature of the project encourages collaboration and continuous improvement of the evaluation framework, fostering a community-driven approach to AI assessment [36][39]. - The new evaluation paradigm is expected to accelerate progress towards AGI by providing clear benchmarks and encouraging a focus on comprehensive model capabilities rather than isolated performance metrics [41][42].
九成以上模型止步白银段位,只有3个铂金!通用AI下半场评测标准来了
机器之心·2025-05-21 00:33