X-SAM：统一图像分割多模态大模型，20+个数据集上均SoTA

Core Insights - The article discusses the development of X-SAM, a unified multimodal large language model for image segmentation, which enhances the capabilities of existing models by allowing for pixel-level understanding and interaction through visual prompts [3][24]. Background and Motivation - Segment Anything Model (SAM) has limitations due to its reliance on single input modes for visual prompts, which restricts its applicability in diverse image segmentation tasks [3]. - Multimodal large language models (MLLMs) excel in tasks like image description and visual question answering but cannot directly handle pixel-level visual tasks, hindering the development of generalized models [3]. Method Design - X-SAM introduces a universal input format and unified output representation, supporting various forms of visual prompts such as points, scribbles, bounding boxes, and masks [6][7]. - The architecture includes dual encoders and projectors to enhance image understanding, a segmentation connector for fine-grained multi-scale features, and a unified segmentation decoder that replaces the original SAM decoder [10][11][12]. Training Strategy - X-SAM employs a three-stage progressive training strategy to optimize performance across diverse image segmentation tasks, including segmentor fine-tuning, alignment pre-training, and mixed fine-tuning [13][19]. - The training process incorporates a dataset balancing resampling strategy to improve performance on underrepresented datasets [15]. Experimental Results - X-SAM has been evaluated on over 20 segmentation datasets, achieving state-of-the-art performance across seven different image segmentation tasks [16]. - Specific performance metrics indicate that X-SAM outperforms existing models in various segmentation tasks, demonstrating its effectiveness in general segmentation, reference segmentation, and interactive segmentation [17][18]. Summary and Outlook - X-SAM represents a significant advancement in the field of image segmentation, transitioning from "segment anything" to "any segmentation" through innovative task design and architecture [24]. - Future research directions include expanding capabilities to video segmentation and integrating temporal information for enhanced video understanding [25].