Workflow
多模态指令数据合成
icon
Search documents
ICCV2025 | One image is all you need,多模态指令数据合成,你只管给图,剩下的交给Oasis
机器之心· 2025-07-18 03:14
Core Viewpoint - The article discusses a novel multimodal instruction data synthesis method called Oasis, which eliminates the need for complex prompt design by relying solely on images for data generation, thereby enhancing efficiency and quality in data synthesis [1][6]. Research Motivation - The traditional multimodal data synthesis methods face issues such as lack of diversity, insufficient quality, and high reliance on manual input, which Oasis aims to address [7][8]. Method Introduction - Oasis operates through three main steps: constructing a hooking prompt for autoregressive sampling, classifying the sampling results to retain instruction-type outputs, and conducting quality control and response generation [11][12]. Data Characteristics Analysis - The Oasis dataset, Oasis-500k, was synthesized from approximately 500,000 images, demonstrating scalability as data volume increases linearly with the number of images [21][22]. - The average instruction length for Oasis data is 76.80, while the average response length is 71.16, indicating richer information content compared to LLaVA-NeXT [24]. - The language diversity in Oasis data includes English (78.52%), Chinese (18.66%), and several other languages, showcasing its broad applicability [27]. Experimental Results - Oasis shows significant performance improvements over baseline models, with average accuracy increases of 3.1% for Vicuna1.5, 1.8% for Qwen2.5, and 3.2% for Llama3 [38]. - The addition of 500k Oasis data resulted in an average score increase of 5.2%, confirming the effectiveness of data scaling [41]. Effectiveness of Oasis - Oasis demonstrates strong capabilities in synthesizing domain-specific data, particularly in OCR tasks, leading to notable performance enhancements in relevant benchmarks [43]. Quality Control Mechanism - The quality control mechanism for instructions is essential, as it significantly improves model performance, with a noted increase of over 7% in specific tasks [50].