强化微调

Search documents
让视觉语言模型像o3一样动手搜索、写代码!Visual ARFT实现多模态智能体能力
机器之心· 2025-05-27 04:11
Core Insights - The article discusses the development of Visual-ARFT, a training method designed to endow visual language models (LVLMs) with "tool agent" capabilities, enabling them to perform complex multimodal reasoning tasks [1][4][5]. Group 1: Visual-ARFT Overview - Visual-ARFT allows models to not only interpret images but also to reason and perform actions, including executing Python code to read specific text areas in images and answering multimodal multi-hop questions through internet searches [2][4]. - The method has been fully open-sourced, including training, evaluation code, data, and models, encouraging exploration in multimodal models, reinforcement learning, and visual language understanding [1][5]. Group 2: Core Capabilities - The model demonstrates three core capabilities: Agentic Search, where it analyzes visual information and retrieves external knowledge; Agentic Coding, where it generates Python code for image processing tasks; and the ability to perform multi-step reasoning [12][9]. - Visual-ARFT employs a rule-based verifiable reward system to encourage the model to explore tool usage and reasoning patterns effectively [7]. Group 3: Evaluation and Performance - The team developed the MAT-Bench (Multimodal Agentic Tool Bench) to evaluate the tool-calling and multimodal reasoning capabilities of models, filling a gap in the current evaluation landscape [9][12]. - Experimental results show that Visual-ARFT significantly outperforms GPT-4o in various sub-tasks, demonstrating its strong potential in completing complex multimodal visual tasks [4][11]. Group 4: Performance Metrics - In the MAT-Search and MAT-Coding benchmarks, Visual-ARFT achieved notable improvements over baseline models, with specific metrics indicating a clear advantage in performance [13][11]. - The Qwen2.5-VL model, enhanced by Visual-ARFT, exhibited significant performance gains in traditional MultihopQA benchmarks, showcasing its generalization capabilities despite limited training data [14].
刚刚,ChatGPT的深度研究可以连接GitHub了!网友:这是真·RAG
量子位· 2025-05-09 00:16
Core Viewpoint - ChatGPT has introduced a new "Deep Research" feature that connects directly to GitHub, allowing users to generate reports based on their code repositories [1][5]. Group 1: Deep Research Functionality - The new feature enables users to request specific reports about their GitHub codebase, including project purpose, architecture, key modules, technology stack, and actionable code quality improvement suggestions [1]. - Users can connect GitHub to ChatGPT, which will then analyze the code repository in real-time and provide relevant answers based on the user's queries [8][9]. - The feature is currently in testing and is available to Team users globally, with plans to roll it out to Plus and Pro users [5]. Group 2: Interaction with GitHub - Users can input search terms in the "Search repos" box to find relevant repositories, and ChatGPT will generate answers based on the connected GitHub repositories [2][3]. - When users ask questions, ChatGPT automatically generates search keywords to find the most relevant code or files within the connected GitHub repositories [11][12]. - OpenAI has clarified that for enterprise products, user content will not be used to improve models by default, while personal version users may have their content used if they opt in [14]. Group 3: Additional Features - OpenAI has also launched a new feature called Reinforcement Fine-Tuning (RFT), which enhances model performance using chain reasoning and task-specific scoring, particularly beneficial for complex domains [15]. - An example provided is AccordanceAI, which has fine-tuned a model for tax and accounting, achieving top performance [15].