RAG、Search Agent不香了？苹果DeepMMSearch-R1杀入多模态搜索新战场

Core Insights - Apple has introduced a new model called DeepMMSearch-R1, which enhances multimodal large language models (MLLMs) for web search by enabling dynamic querying and self-correction during multi-round interactions [1][6]. Model Development - The DeepMMSearch-R1 model addresses limitations in existing methods like retrieval-augmented generation (RAG) and search agents, which often suffer from inefficiencies and poor results due to rigid processes and excessive search calls [1][3]. - The model employs a two-stage training process: supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the Group-Relative Policy Optimization (GRPO) algorithm [3][5][10]. Dataset Creation - Apple has created a new dataset named DeepMMSearchVQA, which includes diverse visual question-answering samples presented in multi-turn dialogue format, ensuring a balanced distribution across different knowledge categories [3][7]. - The dataset consists of approximately 47,000 refined dialogue samples, derived from a random selection of 200,000 samples from the InfoSeek training set, ensuring quality by retaining only those dialogues that align with the predictions of the Gemini-2.5-Pro model [7]. Search Process Integration - The model integrates three tools: a text search tool for targeted queries, a Grounding DINO-based image localization tool for identifying relevant areas in images, and an image search tool for retrieving web content based on input images [4][5]. - This targeted search approach significantly improves retrieval quality and overall performance [3][4]. Performance Metrics - The DeepMMSearch-R1 model has shown significant performance improvements over RAG workflows and prompt-based search agents, achieving a +21.13% and +8.89% increase in performance, respectively [13]. - The model's performance is comparable to OpenAI's o3, indicating its competitive edge in the market [13]. Training Efficiency - The SFT phase focuses on enhancing the language model's reasoning capabilities for web retrieval, while the RL phase optimizes tool selection behavior by reducing unnecessary calls [16][17]. - The model maintains its general visual question-answering capabilities while learning to interact with web search tools effectively [19][20].