Workflow
多模态搜索
icon
Search documents
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
3 6 Ke· 2025-10-17 02:44
Core Insights - Apple has introduced a new model called DeepMMSearch-R1, which enhances multimodal large language models (MLLMs) for web search by enabling dynamic querying and self-correction during multi-round interactions [1][6]. Model Development - The DeepMMSearch-R1 model addresses limitations in existing methods like retrieval-augmented generation (RAG) and search agents, which often suffer from inefficiencies and poor results due to rigid processes and excessive search calls [1][3]. - The model employs a two-stage training process: supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the Group-Relative Policy Optimization (GRPO) algorithm [3][5][10]. Dataset Creation - Apple has created a new dataset named DeepMMSearchVQA, which includes diverse visual question-answering samples presented in multi-turn dialogue format, ensuring a balanced distribution across different knowledge categories [3][7]. - The dataset consists of approximately 47,000 refined dialogue samples, derived from a random selection of 200,000 samples from the InfoSeek training set, ensuring quality by retaining only those dialogues that align with the predictions of the Gemini-2.5-Pro model [7]. Search Process Integration - The model integrates three tools: a text search tool for targeted queries, a Grounding DINO-based image localization tool for identifying relevant areas in images, and an image search tool for retrieving web content based on input images [4][5]. - This targeted search approach significantly improves retrieval quality and overall performance [3][4]. Performance Metrics - The DeepMMSearch-R1 model has shown significant performance improvements over RAG workflows and prompt-based search agents, achieving a +21.13% and +8.89% increase in performance, respectively [13]. - The model's performance is comparable to OpenAI's o3, indicating its competitive edge in the market [13]. Training Efficiency - The SFT phase focuses on enhancing the language model's reasoning capabilities for web retrieval, while the RL phase optimizes tool selection behavior by reducing unnecessary calls [16][17]. - The model maintains its general visual question-answering capabilities while learning to interact with web search tools effectively [19][20].
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
机器之心· 2025-10-17 02:11
Core Insights - Apple has introduced a new solution for empowering multimodal large language models (MLLMs) in multimodal web search, addressing inefficiencies in existing methods like retrieval-augmented generation (RAG) and search agents [1][5]. Group 1: Model Development - The DeepMMSearch-R1 model allows for on-demand multi-round web searches and dynamically generates queries for text and image search tools, improving efficiency and results [1][3]. - A two-stage training process is employed, starting with supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the GRPO algorithm, aimed at optimizing search initiation and tool usage [3][4]. Group 2: Dataset Creation - Apple has created a new dataset called DeepMMSearchVQA, which includes diverse multi-hop visual question-answering samples presented in multi-round dialogue format, balancing different knowledge categories [4][7]. - The dataset construction involved selecting 200,000 samples from the InfoSeek training set, resulting in approximately 47,000 refined dialogue samples for training [7]. Group 3: Training Process - In the SFT phase, the Qwen2.5-VL-7B model is fine-tuned to enhance its reasoning capabilities for web search information while keeping the visual encoder frozen [9]. - The RL phase utilizes GRPO to improve training stability by comparing candidate responses generated under the same prompt, optimizing the model's tool selection behavior [10][12]. Group 4: Performance Results - The DeepMMSearch-R1 model significantly outperforms RAG workflows and prompt-based search agents, achieving a performance increase of +21.13% and +8.89% respectively [16]. - The model's ability to perform targeted image searches and self-reflection enhances overall performance, as demonstrated in various experiments [16][18]. Group 5: Tool Utilization - The model's tool usage behavior aligns with dataset characteristics, with 87.7% tool invocation in the DynVQA dataset and 43.5% in the OKVQA dataset [20]. - The RL model effectively corrects unnecessary tool usage observed in the SFT model, highlighting the importance of RL in optimizing tool efficiency [21]. Group 6: Generalization Capability - The use of LoRA modules during SFT and KL penalty in online GRPO training helps maintain the model's general visual question-answering capabilities across multiple datasets [23][24].