Workflow
多步推理
icon
Search documents
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集
具身智能之心· 2025-08-22 16:03
Core Viewpoint - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks, addressing the significant performance gap in real-world applications [2][3][7]. Summary by Sections Dataset Overview - ToolVQA contains 23,655 samples, featuring real image scenes and implicit multi-step reasoning tasks, closely aligned with actual user interaction needs [3][22]. - The dataset includes 10 types of multimodal tools and 7 task domains, with an average of 2.78 reasoning steps per sample [3][22]. Data Generation Process - The dataset is generated using a novel data construction process called ToolEngine, which employs depth-first search (DFS) and dynamic context example matching to simulate human-like tool usage reasoning chains [3][15][18]. - ToolEngine allows for fully automated generation of high-quality VQA instances from a single image input, significantly reducing data costs and enabling scalability [15][18]. Key Features of ToolVQA - The dataset features complex visual scenes with real-world context and challenging queries requiring implicit multi-step reasoning [13][15]. - Each question necessitates the model to autonomously plan the order of tool calls through multiple interactions, rather than being explicitly prompted [15][20]. - ToolVQA encompasses a rich variety of tools, supporting tasks from text extraction to image understanding and numerical calculations [15][22]. Model Performance - Fine-tuning on ToolVQA significantly enhances model performance, with the 7B model outperforming the closed-source GPT-3.5-turbo on multiple evaluation metrics [3][24]. - The fine-tuned model also demonstrates strong generalization capabilities on out-of-distribution datasets, surpassing GPT-3.5-turbo in various benchmarks [24][25]. Error Analysis - Despite improvements, the analysis of 100 failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that early errors can lead to cumulative failures in multi-step reasoning tasks [27][28]. - The findings highlight the need for enhanced robustness in models when dealing with dynamic feedback and intermediate information integration [28]. Conclusion - ToolVQA establishes a new benchmark for multi-step tool reasoning tasks, providing a structured framework for training and evaluating models' reasoning and tool understanding capabilities [31].
OpenAI拿下IOI金牌,仅次于前五名人类选手!参赛推理模型才夺得IMO金牌
创业邦· 2025-08-12 03:33
Core Viewpoint - OpenAI's reasoning model achieved a gold medal score at the 2025 International Olympiad in Informatics (IOI), ranking first among AI participants and demonstrating significant advancements in general reasoning capabilities [2][9][16]. Group 1: Competition Performance - OpenAI participated in the online AI track of IOI 2025, scoring just behind five human competitors among 330 participants, securing the top position among AI competitors [6][8]. - The model used by OpenAI was not specifically trained for IOI but was based on a general reasoning model that performed exceptionally well [8][14]. - Compared to last year's performance, OpenAI's score improved dramatically from the 49th percentile to the 98th percentile, showcasing a leap in capabilities [9]. Group 2: Model and Strategy - OpenAI utilized the same model that won gold at the International Mathematical Olympiad (IMO) 2025 without any modifications for the IOI competition [14][15]. - The strategy involved sampling answers from different models and using a heuristic method to select submissions, which contributed to the successful outcome [14]. Group 3: Community Reaction and Future Implications - The achievement has sparked excitement in the community, highlighting the growing strength of general reasoning abilities without specialized training [16]. - There is anticipation for OpenAI to release a public version of the technology that led to the gold medal performance, indicating potential for further advancements in AI capabilities [18].
阿里智能体多轮推理超越GPT-4o,开源模型也能做Deep Research
量子位· 2025-06-06 04:01
Group 1 - The core viewpoint of the article is the introduction of WebDancer, an advanced autonomous information retrieval agent developed by Tongyi Lab, which addresses the growing demand for multi-step information retrieval capabilities in an era of information overload [1][2][3]. Group 2 - Background: The traditional search engines are insufficient for users' needs for deep, multi-step information retrieval across various fields such as medical research, technological innovation, and business decision-making [3]. - Challenges: Building autonomous agents faces significant challenges, particularly in obtaining high-quality training data necessary for complex multi-step reasoning [4]. Group 3 - Innovative Data Synthesis: WebDancer proposes two innovative data synthesis methods, ReAct framework and E2HQA, to address data scarcity [5][6]. - ReAct Framework: This framework involves a cycle of Thought-Action-Observation, enabling the agent to generate thoughts, take structured actions, and receive feedback iteratively [5]. Group 4 - Training Strategies: WebDancer employs a two-phase training strategy, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), to enhance the agent's adaptability and decision-making capabilities in dynamic environments [12][13]. - Data Quality Assurance: A multi-stage data filtering strategy is implemented to ensure high-quality training data, enhancing the agent's learning efficiency [9][10]. Group 5 - Experimental Results: WebDancer has demonstrated outstanding performance in various information retrieval benchmark tests, particularly excelling in the GAIA and WebWalkerQA datasets [17][18][19]. - Performance Metrics: The best-performing models achieved a Pass@3 score of 61.1% on the GAIA benchmark and 54.6% on the WebWalkerQA benchmark, showcasing their robust capabilities [20]. Group 6 - Future Prospects: WebDancer aims to integrate more complex tools and expand its capabilities to handle open-domain long-text writing tasks, enhancing the agent's reasoning and generative abilities [29][30]. - Emphasis on Agentic Models: The focus is on developing foundational models that inherently support reasoning, decision-making, and multi-step tool invocation, reflecting a philosophy of simplicity and universality in engineering [30][31].