多步推理

Search documents
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集
具身智能之心· 2025-08-22 16:03
Core Viewpoint - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks, addressing the significant performance gap in real-world applications [2][3][7]. Summary by Sections Dataset Overview - ToolVQA contains 23,655 samples, featuring real image scenes and implicit multi-step reasoning tasks, closely aligned with actual user interaction needs [3][22]. - The dataset includes 10 types of multimodal tools and 7 task domains, with an average of 2.78 reasoning steps per sample [3][22]. Data Generation Process - The dataset is generated using a novel data construction process called ToolEngine, which employs depth-first search (DFS) and dynamic context example matching to simulate human-like tool usage reasoning chains [3][15][18]. - ToolEngine allows for fully automated generation of high-quality VQA instances from a single image input, significantly reducing data costs and enabling scalability [15][18]. Key Features of ToolVQA - The dataset features complex visual scenes with real-world context and challenging queries requiring implicit multi-step reasoning [13][15]. - Each question necessitates the model to autonomously plan the order of tool calls through multiple interactions, rather than being explicitly prompted [15][20]. - ToolVQA encompasses a rich variety of tools, supporting tasks from text extraction to image understanding and numerical calculations [15][22]. Model Performance - Fine-tuning on ToolVQA significantly enhances model performance, with the 7B model outperforming the closed-source GPT-3.5-turbo on multiple evaluation metrics [3][24]. - The fine-tuned model also demonstrates strong generalization capabilities on out-of-distribution datasets, surpassing GPT-3.5-turbo in various benchmarks [24][25]. Error Analysis - Despite improvements, the analysis of 100 failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that early errors can lead to cumulative failures in multi-step reasoning tasks [27][28]. - The findings highlight the need for enhanced robustness in models when dealing with dynamic feedback and intermediate information integration [28]. Conclusion - ToolVQA establishes a new benchmark for multi-step tool reasoning tasks, providing a structured framework for training and evaluating models' reasoning and tool understanding capabilities [31].
OpenAI拿下IOI金牌,仅次于前五名人类选手!参赛推理模型才夺得IMO金牌
创业邦· 2025-08-12 03:33
来源丨 机器之心 (ID: almosthuman2014 ) 编辑丨 杜伟 图源丨 OpenAI 一觉醒来,OpenAI的大模型又完成了一项壮举! 在全球顶级编程赛事之一 ——2025年国际信息学奥林匹克(IOI)中,OpenAI的推理模型取得了足以摘得金牌的高分,并在AI参赛者中排名第一! IOI 2025(即第37届国际信息学奥林匹克)在玻利维亚的苏克雷举行,7月27日正式开幕,并已于8月3日落下了帷幕。在此次赛事中,中国队大获全胜, 全员金牌夺冠。 而就在不久前,OpenAI刚刚在IMO(国际数学奥林匹克竞赛)2025中拿到了金牌级别的成绩。 同样地,OpenAI没有使用互联网或RAG(检索增强生成),仅能访问一个基础的终端工具。 | Rank | First Name | Last Name | ID | Team | SO ... | tri ... | W ... | Inter ... | fe ... | mi ... | ob ... | Inter .. | Global | | --- | --- | --- | --- | --- | --- | --- | --- | --- | ...
阿里智能体多轮推理超越GPT-4o,开源模型也能做Deep Research
量子位· 2025-06-06 04:01
WebDancer团队 投稿 量子位 | 公众号 QbitAI 能够完成多步信息检索任务,涵盖多轮推理与连续动作执行的智能体来了。 通义实验室推出WebWalker(ACL2025)续作自主信息检索智能体WebDancer。 WebDancer 通过系统化的训练范式——涵盖从数据构建到算法设计的全流程——为构建具备长期信息检索能力的智能体提供了明确路径。 同时,该框架也为在开源模型上复现Deep Research系统提供了可行的指导。团队将进一步在更开放的环境中、结合更多工具,持续拓展和 集成Agentic能力,推动通用智能体的落地与演进。 一、背景:信息检索的新需求与挑战 在信息爆炸的时代,传统的搜索引擎已难以满足用户对深层次、多步骤信息获取的需求。从医学研究到科技创新,从商业决策到学术探索,复 杂问题的解决需要深入的信息挖掘和多步推理能力。这催生了对能够自主思考、自主决策的智能体的需求。 然而,构建这样的智能体面临诸多挑战: 二、突破训练数据难获得问题 在自主信息检索领域,高质量的训练数据至关重要。然而,现有的数据集如2WIKI,HotpotQA多为浅层次问题,难以支持复杂多步推理的训 练需求。 数据过滤 ...