Workflow
ToolEngine数据合成引擎
icon
Search documents
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集
具身智能之心· 2025-08-22 16:03
Core Viewpoint - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks, addressing the significant performance gap in real-world applications [2][3][7]. Summary by Sections Dataset Overview - ToolVQA contains 23,655 samples, featuring real image scenes and implicit multi-step reasoning tasks, closely aligned with actual user interaction needs [3][22]. - The dataset includes 10 types of multimodal tools and 7 task domains, with an average of 2.78 reasoning steps per sample [3][22]. Data Generation Process - The dataset is generated using a novel data construction process called ToolEngine, which employs depth-first search (DFS) and dynamic context example matching to simulate human-like tool usage reasoning chains [3][15][18]. - ToolEngine allows for fully automated generation of high-quality VQA instances from a single image input, significantly reducing data costs and enabling scalability [15][18]. Key Features of ToolVQA - The dataset features complex visual scenes with real-world context and challenging queries requiring implicit multi-step reasoning [13][15]. - Each question necessitates the model to autonomously plan the order of tool calls through multiple interactions, rather than being explicitly prompted [15][20]. - ToolVQA encompasses a rich variety of tools, supporting tasks from text extraction to image understanding and numerical calculations [15][22]. Model Performance - Fine-tuning on ToolVQA significantly enhances model performance, with the 7B model outperforming the closed-source GPT-3.5-turbo on multiple evaluation metrics [3][24]. - The fine-tuned model also demonstrates strong generalization capabilities on out-of-distribution datasets, surpassing GPT-3.5-turbo in various benchmarks [24][25]. Error Analysis - Despite improvements, the analysis of 100 failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that early errors can lead to cumulative failures in multi-step reasoning tasks [27][28]. - The findings highlight the need for enhanced robustness in models when dealing with dynamic feedback and intermediate information integration [28]. Conclusion - ToolVQA establishes a new benchmark for multi-step tool reasoning tasks, providing a structured framework for training and evaluating models' reasoning and tool understanding capabilities [31].
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集,引领多模态多步推理VQA新范式
机器之心· 2025-08-22 04:01
Core Insights - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks [3][7][30] - ToolVQA consists of 23,655 task samples, each requiring an average of 2.78 steps of reasoning, and covers 10 types of tools across 7 application domains [21][30] - The dataset was generated using an automated data synthesis engine called ToolEngine, which simulates human-like reasoning processes for tool usage [11][17][30] Dataset Features - ToolVQA is fully automated, requiring only an image input to generate high-quality VQA instances, significantly reducing data costs and enabling scalability [11] - It includes real-world images and contexts, covering complex visual scenes such as news images and e-commerce scenarios, making the tasks more aligned with actual user behavior [11] - The dataset emphasizes implicit multi-step reasoning, where models must autonomously plan the sequence of tool calls without explicit prompts [11][19] Tool Usage and Performance - ToolVQA includes a diverse range of tools, supporting tasks from text extraction to image understanding and numerical calculations, ensuring practical applicability [21] - Experimental results show that fine-tuning models on ToolVQA significantly improves their performance in complex reasoning tasks, surpassing the closed-source model GPT-3.5 on various evaluation metrics [23][30] - The dataset also demonstrates strong generalization capabilities, with fine-tuned models performing well on out-of-distribution datasets [24][30] Error Analysis - Despite the improvements, analysis of failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that models struggle with extracting essential information and synthesizing correct answers [26][30] - The findings highlight the challenges of error accumulation in multi-step reasoning tasks, suggesting that current models lack robustness in dynamic feedback and intermediate information integration [27][30] Conclusion - ToolVQA not only serves as a dataset but also establishes evaluation standards and task frameworks for multimodal tool agents, providing a solid foundation for future advancements in reasoning capabilities and generalization in AI models [30]