AAAI 2026 Oral｜LENS：基于统一强化推理的分割大模型

Core Insights - The article discusses the LENS framework, which aims to overcome the limitations of traditional supervised fine-tuning (SFT) methods in text-prompted image segmentation by integrating reasoning and segmentation processes through reinforcement learning [2][3][9]. Group 1: LENS Framework Overview - LENS introduces an end-to-end reinforcement learning mechanism that combines high-level reasoning with pixel-level execution, enhancing the model's robustness and generalization capabilities in complex tasks [3][9]. - The framework addresses two key issues in segmentation models: limited generalization to unseen prompts and the hidden information bottleneck between reasoning and segmentation processes [6][9]. Group 2: Core Components of LENS - The architecture consists of three main components: 1. Multimodal Large Language Model (MLLM): Acts as the reasoning core, generating a chain of thought and initial bounding box predictions from input images and text instructions [12][13]. 2. Context Module: Serves as an information bridge, transforming the reasoning output into a format usable by the segmentation model [12][14]. 3. Segmentation Model (SAM-2): Executes precise pixel-level mask generation based on the processed information from the context module [13][14]. Group 3: Performance Evaluation - LENS achieved state-of-the-art performance in text-prompted segmentation tasks, with an average cIoU of 81.2% on the RefCOCO benchmark and 78.3% on the more challenging GroundingSuite-Eval, outperforming the second-best method by nearly 10% [18][19]. - The framework's unified reinforcement learning reward mechanism enhances both reasoning and segmentation quality, allowing for self-correction even from imperfect initial predictions [16][17].