复旦等推出「第一人称视听基准」，补齐多模态模型「听觉拼图」

Core Viewpoint - The article discusses the limitations of current multimodal models in understanding sound in egocentric videos, emphasizing the need for models to not only "see" but also "hear" and comprehend the context of sounds in real-world scenarios [1][2][3]. Group 1: Introduction to EgoSound - EgoSound is introduced as the first systematic benchmark for evaluating sound understanding in egocentric videos, developed by a research team from multiple universities [5][6]. - The goal of EgoSound is to enable models to hear, understand, reason, and explain events occurring in the real world [6][7]. Group 2: Benchmark Contributions - EgoSound integrates two complementary datasets: Ego4D, which covers a wide range of daily first-person activities, and EgoBlind, which focuses on scenarios that heavily rely on auditory understanding [9]. - The benchmark consists of seven task categories that cover the complete chain from perception to reasoning, addressing the limitations of previous models that primarily focused on visual information [10]. - A high-quality, large-scale OpenQA dataset was created, comprising 900 carefully selected videos and 7,315 validated open-ended questions, emphasizing the importance of auditory clues in the questions [11][12]. Group 3: Model Evaluation and Findings - The research team evaluated several state-of-the-art (SOTA) multimodal large language models (MLLMs) and provided a systematic analysis to guide future research [13]. - The evaluation revealed a significant gap between human performance (83.9% accuracy) and the best-performing model (56.7% accuracy), indicating that current models struggle to reliably convert sound into meaningful cognition [17][18]. - Key findings highlighted that spatial, temporal, and causal reasoning are the most challenging aspects for models, which often fail to answer questions about the source, timing, and reasoning behind sounds [20]. Group 4: Challenges in Sound Reasoning - Cross-modal alignment remains a bottleneck, as sound clues frequently exist outside the visual frame, necessitating a chain of reasoning that connects hearing, seeing, and inferring [21]. - The complexity of real-world interactions, including occlusions, camera shake, and varying distances of sound sources, has been underestimated, making sound reasoning more challenging [22]. Group 5: Conclusion - The article concludes that while previous multimodal models acted as "visual narrators," EgoSound aims to transform them into true first-person agents capable of both seeing and hearing, thus enhancing their ability to describe, locate, explain, and infer in a non-silent real world [23].