让视觉语言模型像o3一样动手搜索、写代码！Visual ARFT实现多模态智能体能力

Core Insights - The article discusses the development of Visual-ARFT, a training method designed to endow visual language models (LVLMs) with "tool agent" capabilities, enabling them to perform complex multimodal reasoning tasks [1][4][5]. Group 1: Visual-ARFT Overview - Visual-ARFT allows models to not only interpret images but also to reason and perform actions, including executing Python code to read specific text areas in images and answering multimodal multi-hop questions through internet searches [2][4]. - The method has been fully open-sourced, including training, evaluation code, data, and models, encouraging exploration in multimodal models, reinforcement learning, and visual language understanding [1][5]. Group 2: Core Capabilities - The model demonstrates three core capabilities: Agentic Search, where it analyzes visual information and retrieves external knowledge; Agentic Coding, where it generates Python code for image processing tasks; and the ability to perform multi-step reasoning [12][9]. - Visual-ARFT employs a rule-based verifiable reward system to encourage the model to explore tool usage and reasoning patterns effectively [7]. Group 3: Evaluation and Performance - The team developed the MAT-Bench (Multimodal Agentic Tool Bench) to evaluate the tool-calling and multimodal reasoning capabilities of models, filling a gap in the current evaluation landscape [9][12]. - Experimental results show that Visual-ARFT significantly outperforms GPT-4o in various sub-tasks, demonstrating its strong potential in completing complex multimodal visual tasks [4][11]. Group 4: Performance Metrics - In the MAT-Search and MAT-Coding benchmarks, Visual-ARFT achieved notable improvements over baseline models, with specific metrics indicating a clear advantage in performance [13][11]. - The Qwen2.5-VL model, enhanced by Visual-ARFT, exhibited significant performance gains in traditional MultihopQA benchmarks, showcasing its generalization capabilities despite limited training data [14].