Promptable Concept Segmentation (PCS)

Search documents
ICLR神秘论文曝光,SAM3用「概念」看世界,重构视觉AI新范式
3 6 Ke· 2025-10-13 23:57
Core Insights - The upcoming upgrade of the SAM model, SAM 3, focuses on "concept-based segmentation," allowing for segmentation based on semantic concepts rather than just pixels or instances [6][8][15] - SAM 3 introduces a new standard called Promptable Concept Segmentation (PCS), enabling the model to identify and segment all objects that fit a given concept across various images and videos [8][12][16] - The model has been trained on a vast dataset, including approximately 4 million unique concept labels, enhancing its ability to understand and segment based on user prompts [6][11][27] Group 1: SAM 3 Features - SAM 3 emphasizes interactive refinement of segmentation results, allowing users to provide additional prompts to clarify ambiguous cases [8][11] - The model can track multiple instances of the same concept across different frames in a video, improving its utility in dynamic environments [8][12] - SAM 3 achieves significant performance improvements, with a zero-shot segmentation accuracy of 47.0 on the LVIS dataset, surpassing the previous best of 38.5 [11][28] Group 2: Data Engine and Training - A human-AI collaborative data engine has been developed to enhance the training process, allowing the model to learn from its mistakes and improve accuracy [19][22] - The data engine consists of four phases, starting with human validation and progressing to AI-assisted validation and video annotation [21][25] - The final dataset, SA-Co, includes 126,000 samples and 214,000 unique phrases, making it one of the largest open vocabulary segmentation datasets available [28] Group 3: Concept Segmentation Challenges - PCS faces challenges due to the vast range of possible concepts, leading to ambiguities that the model must navigate [14] - To address these ambiguities, SAM 3 employs multi-expert annotations and optimized evaluation protocols to ensure objectivity and accuracy [14][19] - The model includes a dedicated "ambiguity module" to help it understand and tolerate vague boundaries in concept definitions [14][19]
Meta「分割一切」3.0曝光!技能语义分割加入概念提示,好好玩,要爆了
量子位· 2025-10-13 03:35
Core Viewpoint - The article discusses the introduction of SAM 3, a third-generation segmentation model that enhances interactive segmentation capabilities by understanding natural language prompts, allowing for more intuitive and flexible image and video segmentation tasks [3][6][10]. Group 1: Model Features - SAM 3 introduces a new task paradigm called Promptable Concept Segmentation (PCS), enabling the model to segment instances in images or videos based on phrases or image examples [11][12]. - The model supports open vocabulary, allowing users to input any noun phrase as a segmentation target, and can maintain identity consistency across video frames [17]. - SAM 3's architecture includes a Presence Head module that decouples object recognition and localization tasks, improving performance in multi-instance segmentation [16][17]. Group 2: Data Engine and Benchmark - A scalable data engine was built to enhance PCS, generating a training dataset with 4 million unique concept labels and 52 million verified masks [19]. - The SA-Co benchmark was introduced to evaluate the model's performance in open vocabulary segmentation tasks, containing 214,000 unique concepts and covering 50 times more than existing benchmarks [23][24]. Group 3: Performance Metrics - SAM 3 achieved a 47.0% accuracy in zero-shot segmentation tasks on the LVIS dataset, surpassing the previous state-of-the-art (SOTA) of 38.5% [28]. - In the new SA-Co benchmark, SAM 3's performance was at least twice as strong as baseline methods [29]. - The model demonstrated superior performance in video segmentation tasks compared to its predecessor, SAM 2 [30]. Group 4: Real-time Processing - SAM 3 can process images with over 100 entities in approximately 30 milliseconds on H200 GPUs, maintaining near real-time performance for about five concurrent targets in video tasks [35]. Group 5: Limitations - The model struggles to generalize its capabilities to specialized fields such as medical imaging and thermal imaging through zero-shot learning [36]. - In multi-target scenarios during video segmentation tasks, the model's real-time performance may decline, necessitating multi-GPU parallel processing [37].