Robobrain2
Search documents
多模态大模型理解物理工具吗?PhysToolBench提出了衡量多模态大模型对物理工具理解的基准
机器之心· 2025-11-04 08:52
Core Insights - The article discusses the development of PhysToolBench, a benchmark designed to evaluate the understanding of physical tools by multimodal large models, highlighting the need for these models to improve their capabilities in recognizing, understanding, and creating tools [2][22]. Summary by Sections PhysToolBench Introduction - PhysToolBench categorizes the understanding of physical tools into three levels: recognizing tools, understanding tools, and creating tools [2][5]. - The benchmark includes over 1000 image-text pairs where models must identify the appropriate tool for a given task based on visual input [5]. Evaluation Criteria - The evaluation covers 32 of the latest multimodal large models, including proprietary, open-source, and embodied intelligence-specific models [7]. - The assessment is structured into three difficulty levels: Easy (Tool Recognition), Medium (Tool Understanding), and Hard (Tool Creation) [8][6]. Model Performance - The top-performing model, GPT-5, scored 62.15% overall, but many models scored below 50% in higher difficulty levels, indicating a significant gap compared to human performance [13]. - Proprietary models generally outperformed open-source models, with larger models showing better capabilities [13]. Specific Findings - Models struggled with recognizing and understanding tools, particularly in identifying whether tools were usable, leading to potential safety risks [18]. - The research indicates that reasoning capabilities, especially visual-centric reasoning, are crucial for effectively using physical tools [19][22]. Future Directions - The findings suggest that improving the understanding, application, and creation of complex physical tools is essential for advancing towards general intelligence in AI [22]. - The article encourages further exploration and development in this area, providing links to relevant papers, code, and datasets for interested parties [23].