Prompt Learning
Search documents
准确率腰斩,大模型视觉能力一出日常生活就「失灵」
3 6 Ke· 2025-12-09 06:59
Core Insights - The EgoCross project focuses on evaluating cross-domain first-person video question answering, revealing the limitations of existing MLLMs in specialized fields such as surgery, industry, extreme sports, and animal perspectives [1][3][4] Group 1: Project Overview - EgoCross is the first cross-domain EgocentricQA benchmark, covering four high-value professional fields and containing nearly 1,000 high-quality QA pairs [3][9] - The project provides both closed (CloseQA) and open (OpenQA) evaluation formats, addressing a significant gap in the assessment of models in these specialized areas [3][9] Group 2: Model Evaluation - Eight mainstream MLLMs were tested, revealing that even the best-performing models had a CloseQA accuracy below 55% and OpenQA accuracy below 35% in cross-domain scenarios [4][9] - The study found that reinforcement learning (RL) methods could significantly improve performance, with an average increase of 22% in accuracy [10][16] Group 3: Task and Domain Challenges - The research highlights the significant domain shift between everyday activities and specialized fields, with models performing well in daily tasks but struggling in professional contexts [8][9] - The study identified that prediction tasks showed a more severe decline in performance compared to basic identification tasks [13][16] Group 4: Improvement Strategies - Three improvement methods were explored: prompt learning, supervised fine-tuning (SFT), and reinforcement learning (RL), with RL showing the most substantial performance gains [15][16] - The findings suggest that current models have limitations in generalization, indicating a need for further development to create more capable multimodal systems [16]