Workflow
多模态视觉问答
icon
Search documents
以 AI“问诊”珊瑚礁
Core Insights - The Natural Resources Ministry's South China Sea Development Research Institute and Beijing University of Posts and Telecommunications have developed the first multimodal visual question-answering dataset focused on coral image understanding [1] - Coral reefs, often referred to as "the tropical rainforests of the ocean," face global challenges in monitoring and identification, which currently relies heavily on manual interpretation [1] - Existing coral datasets are limited and poorly labeled, making it difficult for traditional visual question-answering (VQA) technologies to assess coral health and symbiotic relationships [1] Dataset Overview - The dataset comprises 12,800 coral images from 67 genera across 20 species, generating 270,000 question-answer pairs based on 16 dimensions such as coral type, location, and quantity [1] - It aims to convert ecological knowledge and professional analysis into intuitive, structured information, allowing users to obtain scientific answers by providing coral images and questions [1] - Compared to general question-answer datasets, this dataset improves average accuracy in visual question-answering tasks and ecological health assessment tasks by 44% and 36%, respectively [1] Future Developments - The research institute plans to enhance the AI model's understanding of coral classification, health status, and ecological relationships by optimizing the coral knowledge graph and utilizing multi-source coral data for ongoing pre-training [1]
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集
具身智能之心· 2025-08-22 16:03
Core Viewpoint - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks, addressing the significant performance gap in real-world applications [2][3][7]. Summary by Sections Dataset Overview - ToolVQA contains 23,655 samples, featuring real image scenes and implicit multi-step reasoning tasks, closely aligned with actual user interaction needs [3][22]. - The dataset includes 10 types of multimodal tools and 7 task domains, with an average of 2.78 reasoning steps per sample [3][22]. Data Generation Process - The dataset is generated using a novel data construction process called ToolEngine, which employs depth-first search (DFS) and dynamic context example matching to simulate human-like tool usage reasoning chains [3][15][18]. - ToolEngine allows for fully automated generation of high-quality VQA instances from a single image input, significantly reducing data costs and enabling scalability [15][18]. Key Features of ToolVQA - The dataset features complex visual scenes with real-world context and challenging queries requiring implicit multi-step reasoning [13][15]. - Each question necessitates the model to autonomously plan the order of tool calls through multiple interactions, rather than being explicitly prompted [15][20]. - ToolVQA encompasses a rich variety of tools, supporting tasks from text extraction to image understanding and numerical calculations [15][22]. Model Performance - Fine-tuning on ToolVQA significantly enhances model performance, with the 7B model outperforming the closed-source GPT-3.5-turbo on multiple evaluation metrics [3][24]. - The fine-tuned model also demonstrates strong generalization capabilities on out-of-distribution datasets, surpassing GPT-3.5-turbo in various benchmarks [24][25]. Error Analysis - Despite improvements, the analysis of 100 failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that early errors can lead to cumulative failures in multi-step reasoning tasks [27][28]. - The findings highlight the need for enhanced robustness in models when dealing with dynamic feedback and intermediate information integration [28]. Conclusion - ToolVQA establishes a new benchmark for multi-step tool reasoning tasks, providing a structured framework for training and evaluating models' reasoning and tool understanding capabilities [31].