RAG、Search Agent不香了？苹果DeepMMSearch-R1杀入多模态搜索新战场

Core Insights - Apple has introduced a new solution for empowering multimodal large language models (MLLMs) in multimodal web search, addressing inefficiencies in existing methods like retrieval-augmented generation (RAG) and search agents [1][5]. Group 1: Model Development - The DeepMMSearch-R1 model allows for on-demand multi-round web searches and dynamically generates queries for text and image search tools, improving efficiency and results [1][3]. - A two-stage training process is employed, starting with supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the GRPO algorithm, aimed at optimizing search initiation and tool usage [3][4]. Group 2: Dataset Creation - Apple has created a new dataset called DeepMMSearchVQA, which includes diverse multi-hop visual question-answering samples presented in multi-round dialogue format, balancing different knowledge categories [4][7]. - The dataset construction involved selecting 200,000 samples from the InfoSeek training set, resulting in approximately 47,000 refined dialogue samples for training [7]. Group 3: Training Process - In the SFT phase, the Qwen2.5-VL-7B model is fine-tuned to enhance its reasoning capabilities for web search information while keeping the visual encoder frozen [9]. - The RL phase utilizes GRPO to improve training stability by comparing candidate responses generated under the same prompt, optimizing the model's tool selection behavior [10][12]. Group 4: Performance Results - The DeepMMSearch-R1 model significantly outperforms RAG workflows and prompt-based search agents, achieving a performance increase of +21.13% and +8.89% respectively [16]. - The model's ability to perform targeted image searches and self-reflection enhances overall performance, as demonstrated in various experiments [16][18]. Group 5: Tool Utilization - The model's tool usage behavior aligns with dataset characteristics, with 87.7% tool invocation in the DynVQA dataset and 43.5% in the OKVQA dataset [20]. - The RL model effectively corrects unnecessary tool usage observed in the SFT model, highlighting the importance of RL in optimizing tool efficiency [21]. Group 6: Generalization Capability - The use of LoRA modules during SFT and KL penalty in online GRPO training helps maintain the model's general visual question-answering capabilities across multiple datasets [23][24].