Workflow
Domain Shift
icon
Search documents
准确率腰斩!大模型视觉能力一出日常生活就「失灵」
量子位· 2025-12-09 01:21
Core Insights - The article discusses the limitations of existing Machine Learning Language Models (MLLMs) in specialized fields such as surgery, industry, extreme sports, and animal perspectives, highlighting the need for a new evaluation benchmark called EgoCross [1][3][9]. Group 1: EgoCross Benchmark - EgoCross is the first cross-domain egocentric video question-answering benchmark, covering four high-value professional fields and providing nearly a thousand high-quality QA pairs [3][9]. - The benchmark includes both closed-book (CloseQA) and open-book (OpenQA) evaluation formats, addressing a significant gap in the assessment of MLLMs [3][9]. Group 2: Model Evaluation and Findings - The research team tested eight mainstream MLLMs, revealing significant cross-domain shortcomings, with the best models achieving less than 55% accuracy in CloseQA and under 35% in OpenQA for cross-domain scenarios [4][12]. - The study found that models performed well in everyday activities but saw a drastic drop in accuracy when applied to specialized fields, with a notable decline from 73.58% in daily activities to 43.14% in cross-domain scenarios [12][18]. Group 3: Task Types and Challenges - The benchmark assesses four core tasks: identification, localization, prediction, and counting, with 15 sub-tasks designed to evaluate model capabilities comprehensively [11][12]. - Prediction tasks, such as forecasting the next action, showed a more significant decline in performance compared to basic identification tasks [18]. Group 4: Improvement Strategies - The research explored three improvement methods: prompt learning, supervised fine-tuning (SFT), and reinforcement learning (RL), with RL showing the most significant performance enhancement, averaging a 22% increase in CloseQA accuracy [15][14]. - SFT demonstrated nearly a 20% performance boost in the industrial domain, indicating the potential for targeted model training [15]. Group 5: Future Directions - The findings provide valuable insights into the current capabilities and limitations of large models, suggesting directions for developing more generalized multimodal systems [16][17].