工具增强的视觉问答(VQA)
Search documents
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集,引领多模态多步推理VQA新范式
机器之心· 2025-08-22 04:01
Core Insights - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks [3][7][30] - ToolVQA consists of 23,655 task samples, each requiring an average of 2.78 steps of reasoning, and covers 10 types of tools across 7 application domains [21][30] - The dataset was generated using an automated data synthesis engine called ToolEngine, which simulates human-like reasoning processes for tool usage [11][17][30] Dataset Features - ToolVQA is fully automated, requiring only an image input to generate high-quality VQA instances, significantly reducing data costs and enabling scalability [11] - It includes real-world images and contexts, covering complex visual scenes such as news images and e-commerce scenarios, making the tasks more aligned with actual user behavior [11] - The dataset emphasizes implicit multi-step reasoning, where models must autonomously plan the sequence of tool calls without explicit prompts [11][19] Tool Usage and Performance - ToolVQA includes a diverse range of tools, supporting tasks from text extraction to image understanding and numerical calculations, ensuring practical applicability [21] - Experimental results show that fine-tuning models on ToolVQA significantly improves their performance in complex reasoning tasks, surpassing the closed-source model GPT-3.5 on various evaluation metrics [23][30] - The dataset also demonstrates strong generalization capabilities, with fine-tuned models performing well on out-of-distribution datasets [24][30] Error Analysis - Despite the improvements, analysis of failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that models struggle with extracting essential information and synthesizing correct answers [26][30] - The findings highlight the challenges of error accumulation in multi-step reasoning tasks, suggesting that current models lack robustness in dynamic feedback and intermediate information integration [27][30] Conclusion - ToolVQA not only serves as a dataset but also establishes evaluation standards and task frameworks for multimodal tool agents, providing a solid foundation for future advancements in reasoning capabilities and generalization in AI models [30]