Workflow
Grounding DINO
icon
Search documents
下一代目标检测模型:3B参数MLLM Rex-Omni首度超越Grounding DINO,统一10+视觉任务
机器之心· 2025-11-13 08:26
Core Insights - The article discusses the breakthrough of the Rex-Omni model, which surpasses traditional coordinate regression detectors in target localization accuracy, addressing long-standing criticisms of multimodal large language models (MLLM) [2][4]. Group 1: Model Design and Innovations - Rex-Omni integrates all visual perception tasks into a unified "next point prediction" framework, utilizing efficient 4-Token coordinate encoding and a two-stage GRPO reinforcement learning training process [4][11]. - The model's design includes a unique output format with quantized coordinates and special tokens, allowing for efficient representation of various geometric outputs [13][14]. - Rex-Omni employs multiple data engines (Grounding, Referring, Pointing, and OCR) to generate high-quality training signals, enhancing its semantic understanding and spatial reasoning capabilities [16][17]. Group 2: Training Methodology - The two-stage training approach of SFT (Supervised Fine-Tuning) and GRPO (Geometric Reward-based Policy Optimization) is crucial for achieving high localization accuracy and correcting behavioral deficiencies [19][21]. - GRPO introduces geometric reward functions, enabling the model to learn from its generated sequences and significantly improving performance with minimal additional training steps [19][21]. Group 3: Performance Evaluation - In zero-shot evaluations on core detection benchmarks like COCO and LVIS, Rex-Omni demonstrates superior performance, achieving an F1-score that surpasses traditional models like Grounding DINO [20][22]. - The model excels in dense and small object detection tasks, achieving the highest F1@mIoU performance among MLLMs, showcasing its refined spatial localization capabilities [27][28]. - Rex-Omni's unified framework allows it to effectively handle various visual perception tasks, outperforming traditional open-set detectors in referring object detection [31][34]. Group 4: Conclusion and Future Implications - Rex-Omni represents a significant advancement for MLLMs in visual perception, proving that they can overcome geometric and behavioral limitations to achieve precise geometric perception and robust language understanding [45]. - The model sets a new performance benchmark in the MLLM field and indicates a promising direction for the development of next-generation target detection models [45].
RAG、Search Agent不香了?苹果DeepMMSearch-R1杀入多模态搜索新战场
机器之心· 2025-10-17 02:11
Core Insights - Apple has introduced a new solution for empowering multimodal large language models (MLLMs) in multimodal web search, addressing inefficiencies in existing methods like retrieval-augmented generation (RAG) and search agents [1][5]. Group 1: Model Development - The DeepMMSearch-R1 model allows for on-demand multi-round web searches and dynamically generates queries for text and image search tools, improving efficiency and results [1][3]. - A two-stage training process is employed, starting with supervised fine-tuning (SFT) followed by online reinforcement learning (RL) using the GRPO algorithm, aimed at optimizing search initiation and tool usage [3][4]. Group 2: Dataset Creation - Apple has created a new dataset called DeepMMSearchVQA, which includes diverse multi-hop visual question-answering samples presented in multi-round dialogue format, balancing different knowledge categories [4][7]. - The dataset construction involved selecting 200,000 samples from the InfoSeek training set, resulting in approximately 47,000 refined dialogue samples for training [7]. Group 3: Training Process - In the SFT phase, the Qwen2.5-VL-7B model is fine-tuned to enhance its reasoning capabilities for web search information while keeping the visual encoder frozen [9]. - The RL phase utilizes GRPO to improve training stability by comparing candidate responses generated under the same prompt, optimizing the model's tool selection behavior [10][12]. Group 4: Performance Results - The DeepMMSearch-R1 model significantly outperforms RAG workflows and prompt-based search agents, achieving a performance increase of +21.13% and +8.89% respectively [16]. - The model's ability to perform targeted image searches and self-reflection enhances overall performance, as demonstrated in various experiments [16][18]. Group 5: Tool Utilization - The model's tool usage behavior aligns with dataset characteristics, with 87.7% tool invocation in the DynVQA dataset and 43.5% in the OKVQA dataset [20]. - The RL model effectively corrects unnecessary tool usage observed in the SFT model, highlighting the importance of RL in optimizing tool efficiency [21]. Group 6: Generalization Capability - The use of LoRA modules during SFT and KL penalty in online GRPO training helps maintain the model's general visual question-answering capabilities across multiple datasets [23][24].