Workflow
协同泛化效应
icon
Search documents
GPT-4V仅达Level-2?全球首个多模态通才段位排行榜发布,General-Level打造多模态通用AI评测新范式
量子位· 2025-05-16 01:24
Core Insights - The article discusses the rapid rise of Multimodal Large Language Models (MLLMs) that can understand and generate multiple modalities such as images, text, audio, and video. It emphasizes the need for a scientific evaluation mechanism to assess these models effectively as the AI competition evolves [1][2]. Evaluation Framework - The General-Level evaluation framework introduces a five-tier ranking system to measure the generalist capabilities of multimodal models, focusing on the synergy effect where knowledge transfer occurs between different tasks and modalities [3][12]. - The five levels are: - Level-1: Specialist models fine-tuned for specific tasks [6]. - Level-2: Generalists that support multiple modalities without synergy [11]. - Level-3: Task-level synergy where models outperform specialist models in certain tasks [11]. - Level-4: Paradigm-level synergy indicating integrated reasoning capabilities across understanding and generation tasks [7]. - Level-5: Total synergy across all modalities, representing the ultimate goal of achieving AGI, which no model has yet reached [9][72]. General-Bench Evaluation Benchmark - General-Bench is described as the largest and most comprehensive evaluation benchmark for multimodal AI, covering over 700 tasks and 325,000+ samples across five core modalities: image, video, audio, 3D, and language [14][17]. - It includes a wide range of tasks from traditional understanding tasks to generative tasks, allowing for free-form responses and objective assessments [15][18]. Leaderboard Design - The Leaderboard system is designed to present evaluation results transparently, featuring a multi-tiered scope mechanism that allows models of varying capabilities to compete in different categories [19][20]. - Scope-A is the main leaderboard for "full-modal generalists," while Scope-B, Scope-C, and Scope-D focus on specific modalities, understanding/generation tasks, and detailed skill categories, respectively [22][24][27][29]. Current Leaderboard Status - As of now, the Leaderboard includes over 100 multimodal models, with a significant number classified as Level-2, indicating they support a wide range of tasks but lack synergy [56][61]. - Level-3 models demonstrate task-level synergy, outperforming specialist models in certain benchmarks, while Level-4 models are rare and show promise in cross-paradigm reasoning [65][69]. - No models have yet achieved Level-5, highlighting the challenges in reaching comprehensive multimodal synergy [72][75]. Community Engagement - The General-Level project encourages community participation, allowing researchers to submit models and contribute to the benchmark's task diversity, fostering an open and collaborative environment for advancing multimodal AI [77].