突破SAM局限！美团提出X-SAM：统一框架横扫20+分割基准

Core Insights - The article discusses the introduction of X-SAM, a new segmentation framework that overcomes the limitations of the Segment Anything Model (SAM) by enabling multi-task processing and integrating multi-modal capabilities [3][4][5]. Group 1: Limitations of SAM - SAM was initially seen as a universal solution for visual segmentation but has significant limitations, including single-task focus, inability to understand text instructions, and inefficiency due to the need for multiple models for different tasks [5][6][7]. Group 2: Innovations of X-SAM - X-SAM integrates SAM's visual segmentation capabilities with multi-modal understanding from large language models (LLMs) through a unified input format, a dual-encoder architecture, and multi-stage training [12][13][21]. - The unified input format allows various segmentation tasks to be processed in a consistent manner, enhancing the model's ability to understand both text and visual prompts [13][15]. - The dual-encoder architecture consists of a global image encoder and a segmentation encoder, optimizing both overall scene understanding and pixel-level detail [14][19]. - Multi-stage training involves fine-tuning the segmentation model, aligning visual and language features, and mixed fine-tuning across diverse datasets to enhance generalization [21][23]. Group 3: Performance Metrics - X-SAM has demonstrated superior performance across over 20 datasets and 7 core tasks, achieving state-of-the-art results in various segmentation benchmarks [27][28]. - In the COCO dataset, X-SAM achieved a panorama quality (PQ) score of 54.7, closely following the best-performing model, Mask2Former [31]. - For open vocabulary segmentation, X-SAM's average precision (AP) reached 16.2, significantly outperforming other models [31]. - In referring segmentation tasks, X-SAM achieved corrected Intersection over Union (cIoU) scores of 85.1, 78.0, and 83.8 across different datasets, surpassing competitors [32]. Group 4: New Task Introduction - X-SAM introduces a new task called Visual Grounding Detection (VGD) segmentation, which allows the model to segment all instances of a class based on visual prompts, even across different images [25][26][35]. - In experiments, X-SAM achieved average precision scores of 47.9 to 49.7 for VGD segmentation, significantly exceeding existing models [35]. Group 5: Future Directions - The research team plans to extend X-SAM's capabilities to video segmentation and dynamic scenes, aiming to enhance its application in temporal visual understanding [43].