多模态搜索

Search documents
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
3 6 Ke· 2025-10-17 02:44
Core Insights - Apple has introduced a new model called DeepMMSearch-R1, which enhances multimodal large language models (MLLMs) for web search by enabling dynamic querying and self-correction during multi-round interactions [1][6]. Model Development - The DeepMMSearch-R1 model addresses limitations in existing methods like retrieval-augmented generation (RAG) and search agents, which often suffer from inefficiencies and poor results due to rigid processes and excessive search calls [1][3]. - The model employs a two-stage training process: supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the Group-Relative Policy Optimization (GRPO) algorithm [3][5][10]. Dataset Creation - Apple has created a new dataset named DeepMMSearchVQA, which includes diverse visual question-answering samples presented in multi-turn dialogue format, ensuring a balanced distribution across different knowledge categories [3][7]. - The dataset consists of approximately 47,000 refined dialogue samples, derived from a random selection of 200,000 samples from the InfoSeek training set, ensuring quality by retaining only those dialogues that align with the predictions of the Gemini-2.5-Pro model [7]. Search Process Integration - The model integrates three tools: a text search tool for targeted queries, a Grounding DINO-based image localization tool for identifying relevant areas in images, and an image search tool for retrieving web content based on input images [4][5]. - This targeted search approach significantly improves retrieval quality and overall performance [3][4]. Performance Metrics - The DeepMMSearch-R1 model has shown significant performance improvements over RAG workflows and prompt-based search agents, achieving a +21.13% and +8.89% increase in performance, respectively [13]. - The model's performance is comparable to OpenAI's o3, indicating its competitive edge in the market [13]. Training Efficiency - The SFT phase focuses on enhancing the language model's reasoning capabilities for web retrieval, while the RL phase optimizes tool selection behavior by reducing unnecessary calls [16][17]. - The model maintains its general visual question-answering capabilities while learning to interact with web search tools effectively [19][20].
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
机器之心· 2025-10-17 02:11
机器之心报道 编辑:杜伟 苹果最近真是「高产」! 这几天,苹果 在多模态 web 搜索中发现了赋能多模态大语言模型(MLLM)的新解法 。 在现实世界的应用中,MLLM 需要访问外部知识源,并对动态变化的现实世界信息进行实时响应,从而解决信息检索和知识密集型的用户查询。当前的一些方 法,比如检索增强生成(RAG)、search agent 以及配备搜索功能的多模态大模型,往往存在流程僵化、搜索调用过多以及搜索查询构造不当等问题,导致效率低 下以及结果不理想。 为了克服以往研究中暴露出的局限, 苹果提出了 DeepMMSearch-R1 模型 。该模型能够按需执行多轮网络搜索,并可针对文本与图像搜索工具动态生成查询,如 图 1(右)所示。具体而言,DeepMMSearch-R1 能够通过自我反思与自我纠正,在多轮交互中自适应地生成和优化文本搜索查询,并利用检索到的内容作为反馈 以及结合原始问题进行改进。 为了提升图像搜索的效果,苹果引入一个 中间图像裁剪工具( Grounding DINO ) 来应对背景噪声和干扰性视觉实体带来的挑战。过程中,DeepMMSearch-R1 首 先生成与问题最相关视觉实体的指代 ...