Workflow
ToolVQA数据集
icon
Search documents
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集
具身智能之心· 2025-08-22 16:03
Core Viewpoint - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks, addressing the significant performance gap in real-world applications [2][3][7]. Summary by Sections Dataset Overview - ToolVQA contains 23,655 samples, featuring real image scenes and implicit multi-step reasoning tasks, closely aligned with actual user interaction needs [3][22]. - The dataset includes 10 types of multimodal tools and 7 task domains, with an average of 2.78 reasoning steps per sample [3][22]. Data Generation Process - The dataset is generated using a novel data construction process called ToolEngine, which employs depth-first search (DFS) and dynamic context example matching to simulate human-like tool usage reasoning chains [3][15][18]. - ToolEngine allows for fully automated generation of high-quality VQA instances from a single image input, significantly reducing data costs and enabling scalability [15][18]. Key Features of ToolVQA - The dataset features complex visual scenes with real-world context and challenging queries requiring implicit multi-step reasoning [13][15]. - Each question necessitates the model to autonomously plan the order of tool calls through multiple interactions, rather than being explicitly prompted [15][20]. - ToolVQA encompasses a rich variety of tools, supporting tasks from text extraction to image understanding and numerical calculations [15][22]. Model Performance - Fine-tuning on ToolVQA significantly enhances model performance, with the 7B model outperforming the closed-source GPT-3.5-turbo on multiple evaluation metrics [3][24]. - The fine-tuned model also demonstrates strong generalization capabilities on out-of-distribution datasets, surpassing GPT-3.5-turbo in various benchmarks [24][25]. Error Analysis - Despite improvements, the analysis of 100 failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that early errors can lead to cumulative failures in multi-step reasoning tasks [27][28]. - The findings highlight the need for enhanced robustness in models when dealing with dynamic feedback and intermediate information integration [28]. Conclusion - ToolVQA establishes a new benchmark for multi-step tool reasoning tasks, providing a structured framework for training and evaluating models' reasoning and tool understanding capabilities [31].
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集,引领多模态多步推理VQA新范式
机器之心· 2025-08-22 04:01
打破合成范式:ToolVQ A 开启真实图像下的多 步工具问答新纪元 本文提出了一种全新的多模态视觉问答数据集 ——ToolVQA,通过真实世界任务与复杂工具链模拟,为大模型提供系统化、多步推理的训练与评估基准。当前, 将外部工具集成进大模型(Large Foundation Models, LFMs)已成为提升其复杂任务处理能力的重要方向。借助外部工具,模型可以将难题拆解为更小的子任务, 交由特定功能的工具处理,从而实现更强的泛化与执行力。 本文第一作者是来自北京大学的本科生殷绍峰,合作者包含来自北京大学的博士生雷廷,通讯作者为北京大学王选计算机研究所研究员、助理教授刘洋。 本文主要介绍来自该团队的最新论文:ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools。 本文提出了一个旨在提升基础模型工具使用能力的大型多模态数据集 ——ToolVQA。现有研究已在工具增强的视觉问答(VQA)任务中展现出较强性能,但在真 实世界中,多模态任务往往涉及多步骤推理与功能多样的工具使用,现有模型在此方面仍存在显著差距。 为弥补这一空缺, To ...