Supervised Fine-Tuning (SFT)
Search documents
准确率腰斩,大模型视觉能力一出日常生活就「失灵」
3 6 Ke· 2025-12-09 06:59
Core Insights - The EgoCross project focuses on evaluating cross-domain first-person video question answering, revealing the limitations of existing MLLMs in specialized fields such as surgery, industry, extreme sports, and animal perspectives [1][3][4] Group 1: Project Overview - EgoCross is the first cross-domain EgocentricQA benchmark, covering four high-value professional fields and containing nearly 1,000 high-quality QA pairs [3][9] - The project provides both closed (CloseQA) and open (OpenQA) evaluation formats, addressing a significant gap in the assessment of models in these specialized areas [3][9] Group 2: Model Evaluation - Eight mainstream MLLMs were tested, revealing that even the best-performing models had a CloseQA accuracy below 55% and OpenQA accuracy below 35% in cross-domain scenarios [4][9] - The study found that reinforcement learning (RL) methods could significantly improve performance, with an average increase of 22% in accuracy [10][16] Group 3: Task and Domain Challenges - The research highlights the significant domain shift between everyday activities and specialized fields, with models performing well in daily tasks but struggling in professional contexts [8][9] - The study identified that prediction tasks showed a more severe decline in performance compared to basic identification tasks [13][16] Group 4: Improvement Strategies - Three improvement methods were explored: prompt learning, supervised fine-tuning (SFT), and reinforcement learning (RL), with RL showing the most substantial performance gains [15][16] - The findings suggest that current models have limitations in generalization, indicating a need for further development to create more capable multimodal systems [16]
一招缓解LLM偏科!调整训练集组成,“秘方”在此 | 上交大&上海AI Lab等
量子位· 2025-06-10 07:35AI Processing
IDEAL团队 投稿 量子位 | 公众号 QbitAI 大幅缓解LLM偏科,只需调整SFT训练集的组成。 本来不擅长coding的Llama 3.1-8B,代码能力明显提升。 上海交大&上海AI Lab联合团队提出创新方法 IDEAL ,可显著提升LLM在多种不同领域上的综合性能。 此外,研究还有一些重要发现,比如: 具体来看—— SFT后LLM部分能力甚至退化 大型语言模型 (LLM) 凭借其强大的理解和逻辑推理能力,在多个领域展现了惊人的能力。除了模型参数量的增大, 高质量的数据是公认的LLM性能提升最关键的影响因素。 当对模型进行监督微调(SFT)时,研究人员发现 LLM在多任务场景下常出现"偏科"现象 ——部分能力突出而部分 能力并未涨进,甚至退化。这种不平衡的现象导致大模型在不同的领域上能力不同,进而影响用户体验。 上海交大和上海AI Lab的研究者迅速将目光聚焦到SFT训练的训练集上,是否可以通过调整训练集的组成来缓解LLM 偏科的情况?直觉上来看,直接将LLM的弱势科目的训练数据增加一倍,就可以让最后的结果发生变化。但是,由于 训练数据之间的耦合关系,研究者通过建模量化每个领域数据对于最终结果的 ...