会“思考”的目标检测模型来了！IDEA提出Rex-Thinker：基于思维链的指代物体检测模型，准确率+可解释性双突破

Core Insights - The article discusses the introduction of Rex-Thinker, a new solution by IDEA that incorporates logical reasoning chains into visual reference tasks, significantly improving AI's ability to understand and locate objects based on human-like reasoning [2][5]. Group 1: Innovation and Methodology - Rex-Thinker innovatively constructs an interpretable reasoning framework that includes three main steps: Planning, Action, and Summarization, allowing the AI to break down language instructions into actionable steps [5][10]. - The model employs a retrieval-based detection strategy, first generating candidate boxes using an open vocabulary detector, followed by reasoning through each candidate to produce structured outputs [9][10]. - The final output is standardized in JSON format, enhancing interpretability and reliability of the reasoning process [10]. Group 2: Training and Data - The HumanRef-CoT dataset was created by augmenting the existing HumanRef dataset with 90,000 chain reasoning examples generated by GPT-4o, establishing a foundation for training models with reasoning capabilities [12][14]. - The training process consists of two phases: supervised fine-tuning (SFT) on the HumanRef-CoT dataset and GRPO-based reinforcement learning, which enhances reasoning quality and robustness [16][19]. Group 3: Performance and Results - Rex-Thinker demonstrated significant performance improvements on the HumanRef Benchmark, with the introduction of CoT supervision leading to an average DF1 score increase of 0.9 points and a notable 13.8 percentage point improvement in rejection scores [21]. - In the RefCOCOg dataset, Rex-Thinker exhibited strong transfer capabilities, achieving competitive performance without targeted fine-tuning, further validated by minor GRPO adjustments [22]. Group 4: Visualization and Interpretability - The article highlights the visualization of Rex-Thinker's reasoning process, showcasing how the model verifies conditions step-by-step and outputs results or declines predictions, emphasizing its clear reasoning path and interpretability [24].