ICLR 2026｜多模态大模型真的理解情绪吗？MME-Emotion给出了系统答案

Core Viewpoint - Multimodal Large Language Models (MLLMs) are rapidly transforming AI capabilities, particularly in understanding human emotions through various modalities [2][3]. Group 1: MME-Emotion Benchmark - MME-Emotion is a comprehensive evaluation benchmark for emotional intelligence in MLLMs, developed by a team from The Chinese University of Hong Kong and Alibaba's Tongyi Laboratory, and accepted at ICLR 2026 [3]. - It is one of the largest multimodal emotional intelligence evaluation benchmarks, containing approximately 6,500 video segments and corresponding Q&A data, covering 27 real-world scenarios and designed with 8 different emotional tasks [5]. - The benchmark emphasizes the integration of multimodal information in real environments, requiring models to understand visual, auditory, and linguistic information simultaneously [5]. Group 2: Evaluation Tasks and Metrics - The tasks include laboratory emotion recognition, real-world emotion recognition, noise condition emotion recognition, fine-grained emotion recognition, multi-label emotion recognition, sentiment analysis, fine-grained sentiment analysis, and intent recognition [8]. - MME-Emotion evaluates both emotion recognition and reasoning capabilities, distinguishing between merely guessing the correct emotion label and genuinely understanding the underlying emotional cues [8]. - A unified evaluation metric system is proposed, including Recognition Score, Reasoning Score, and Chain-of-Thought Score, to assess the accuracy of emotion predictions, the rationality of reasoning processes, and the overall performance [10]. Group 3: Model Performance and Challenges - The evaluation of 20 mainstream multimodal models revealed that even the best-performing models scored below 40% in emotion recognition and around 56% in Chain-of-Thought Score, indicating significant shortcomings in emotional intelligence [13]. - Key issues identified include insufficient fine-grained visual understanding, limited multimodal information fusion capabilities, and a correlation between reasoning ability and emotion recognition performance [14][15][16]. - The findings suggest that enhancing models' reasoning processes may be a crucial pathway to improving emotional intelligence [16]. Group 4: Future Directions - Future advancements in multimodal emotional intelligence may rely on higher precision in visual detail modeling, more effective methods for fusing auditory and visual information, and reasoning mechanisms that can explain the causes of emotions [16]. - The release of MME-Emotion provides a unified evaluation standard and a clear reference baseline for subsequent model improvements [17].