Workflow
FutureOmni
icon
Search documents
音频-视觉全模态的未来预测,FutureOmni给出了首份答卷
机器之心· 2026-01-24 01:53
Core Insights - FutureOmni is the first comprehensive evaluation benchmark for multimodal future forecasting, developed by Fudan University, Shanghai Chuangzhi Academy, and National University of Singapore, focusing on predicting future events from audio-visual cues [2][3] - Current multimodal large language models (MLLMs) show significant challenges in future event prediction, with the best accuracy only reaching 64.8% [21] Evaluation Paradigm Shift - The evaluation paradigm has shifted from retrospective understanding to future prediction, requiring models to predict "what will happen next" rather than just describing "what happened" [5][9] - Existing benchmarks primarily focus on retrospective analysis, neglecting the predictive capabilities of models [7][8] FutureOmni Dataset - The FutureOmni dataset consists of 919 videos and 1,034 multiple-choice questions, covering eight major domains including education, emergencies, and daily life [18] - The dataset ensures 100% originality, with all videos collected being first-time sources [18] Model Performance - Evaluations of 13 multimodal models and 7 video-only models reveal that even the best-performing models struggle with future predictions, indicating a gap compared to human-level performance [21] - The analysis shows that speech scenarios are the most challenging for models, while music scenarios are relatively easier [24] OFF Training Strategy - The proposed Omni-Modal Future Forecasting (OFF) strategy aims to enhance models' predictive capabilities by training them to understand the causal relationships between audio and visual information [30][31] - Results indicate that the combination of audio and video significantly outperforms using video alone, highlighting the importance of audio information in future predictions [33] Future Prospects - FutureOmni sets a foundation for evaluating the future forecasting capabilities of multimodal large language models, with hopes for broader model participation and method improvements [41][43]