MMAR基准测试

Search documents
AI集体“听不懂”!MMAR基准测试揭示音频大模型巨大短板
量子位· 2025-06-09 05:24
Core Viewpoint - The MMAR benchmark reveals that most AI models struggle significantly with complex audio reasoning tasks, indicating a gap in their practical applicability in real-world scenarios [1][9][18]. Summary by Sections MMAR Benchmark Overview - MMAR stands for "A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix," consisting of 1000 high-quality audio understanding questions that require multi-step reasoning capabilities [2][3]. Difficulty of MMAR - The benchmark includes questions that assess various reasoning levels, such as signal, perception, semantic, and cultural understanding, with tasks requiring complex reasoning skills and domain-specific knowledge [6][9]. Model Performance - A total of 30 audio-related models were tested, with the best open-source model, Qwen-2.5-Omni, achieving an average accuracy of only 56.7%, while the closed-source model Gemini 2.0 Flash led with 65.6% [11][18]. - Most open-source models performed close to random guessing, particularly in music-related tasks, highlighting significant challenges in recognizing deeper audio information [12][18]. Error Analysis - The primary errors identified in the models included perceptual errors (37%), reasoning errors (20%), knowledge gaps (9%), and other errors (34%), indicating that current AI models face both auditory and cognitive challenges [19]. Future Outlook - The research emphasizes the need for collaboration in data and algorithm innovation to improve audio reasoning capabilities in AI, with a hope for future models that can truly understand audio content and context [20][21].