X-SAM：从「分割一切」到「任意分割」：统一图像分割多模态大模型，在20+个图像分割数据集上均达SoTA

Core Viewpoint - The article discusses the development of X-SAM, a unified multimodal large language model for image segmentation, which enhances the capabilities of existing models by allowing for pixel-level understanding and interaction through visual prompts [4][26]. Background and Motivation - Segment Anything Model (SAM) excels in dense segmentation mask generation but is limited by its reliance on single input modes, hindering its applicability across various segmentation tasks [4]. - Multimodal large language models (MLLMs) have shown promise in tasks like image description and visual question answering but are fundamentally restricted in handling pixel-level visual tasks, which limits the development of generalized models [4]. Method Design - X-SAM introduces a unified framework that extends the segmentation paradigm from "segment anything" to "any segmentation" by incorporating visual grounded segmentation (VGS) tasks [4]. - The model employs a dual projectors architecture to enhance image understanding and a segmentation connector to provide rich multi-scale information for segmentation tasks [11][12]. - X-SAM utilizes a three-stage progressive training strategy to optimize performance across diverse image segmentation tasks, including segmentor fine-tuning, alignment pre-training, and mixed fine-tuning [16][22]. Experimental Results - X-SAM has been evaluated on over 20 segmentation datasets, achieving state-of-the-art performance across seven different image segmentation tasks [19]. - The model's performance metrics indicate significant improvements in various segmentation tasks compared to existing models, showcasing its versatility and effectiveness [20][21]. Summary and Outlook - X-SAM represents a significant advancement in the field of image segmentation, establishing a foundation for future research in video segmentation and the integration of temporal information [26]. - Future directions include expanding the model's capabilities to video segmentation tasks, potentially enhancing video understanding technologies [26].