Semantic Segmentation

Search documents
Meta「分割一切」3.0曝光,技能语义分割加入概念提示,好好玩,要爆了
3 6 Ke· 2025-10-13 03:52
Core Insights - The article discusses the introduction of SAM 3, a third-generation segmentation model that can understand natural language prompts for image and video segmentation tasks [1][3][5]. Group 1: Model Capabilities - SAM 3 can segment images and videos based on user-defined phrases, allowing for more interactive and intuitive segmentation tasks [3][6]. - The model processes images containing over 100 objects in just 30 milliseconds, demonstrating near real-time capabilities for video processing [5][21]. - SAM 3 introduces a new task paradigm called Promptable Concept Segmentation (PCS), which allows for multi-instance segmentation based on various input prompts [6][7]. Group 2: Technical Innovations - The architecture of SAM 3 includes a new detection module based on the Deformable Transformer (DETR), which separates object recognition and localization tasks to enhance detection accuracy [11]. - A scalable data engine was developed to create a training dataset with 4 million unique concept labels and 52 million validated masks, improving the model's performance [12]. - The SA-Co benchmark was introduced to evaluate the model's performance in open vocabulary segmentation tasks, significantly expanding the concept coverage compared to existing benchmarks [13]. Group 3: Performance Metrics - SAM 3 achieved a 47.0% accuracy in zero-shot segmentation tasks on the LVIS dataset, surpassing the previous state-of-the-art (SOTA) of 38.5% [16]. - In the new SA-Co benchmark, SAM 3's performance is at least twice as strong as baseline methods [16]. - The model also outperformed SAM 2 in video segmentation tasks, indicating significant improvements in performance [18]. Group 4: Future Directions - Researchers are exploring the combination of SAM 3 with multimodal large models (MLLM) to tackle more complex segmentation tasks, such as identifying specific scenarios in images [19]. - Despite its advancements, SAM 3 still faces challenges in generalizing to specialized fields like medical imaging and thermal imaging through zero-shot learning [21].
Meta「分割一切」3.0曝光!技能语义分割加入概念提示,好好玩,要爆了
量子位· 2025-10-13 03:35
Core Viewpoint - The article discusses the introduction of SAM 3, a third-generation segmentation model that enhances interactive segmentation capabilities by understanding natural language prompts, allowing for more intuitive and flexible image and video segmentation tasks [3][6][10]. Group 1: Model Features - SAM 3 introduces a new task paradigm called Promptable Concept Segmentation (PCS), enabling the model to segment instances in images or videos based on phrases or image examples [11][12]. - The model supports open vocabulary, allowing users to input any noun phrase as a segmentation target, and can maintain identity consistency across video frames [17]. - SAM 3's architecture includes a Presence Head module that decouples object recognition and localization tasks, improving performance in multi-instance segmentation [16][17]. Group 2: Data Engine and Benchmark - A scalable data engine was built to enhance PCS, generating a training dataset with 4 million unique concept labels and 52 million verified masks [19]. - The SA-Co benchmark was introduced to evaluate the model's performance in open vocabulary segmentation tasks, containing 214,000 unique concepts and covering 50 times more than existing benchmarks [23][24]. Group 3: Performance Metrics - SAM 3 achieved a 47.0% accuracy in zero-shot segmentation tasks on the LVIS dataset, surpassing the previous state-of-the-art (SOTA) of 38.5% [28]. - In the new SA-Co benchmark, SAM 3's performance was at least twice as strong as baseline methods [29]. - The model demonstrated superior performance in video segmentation tasks compared to its predecessor, SAM 2 [30]. Group 4: Real-time Processing - SAM 3 can process images with over 100 entities in approximately 30 milliseconds on H200 GPUs, maintaining near real-time performance for about five concurrent targets in video tasks [35]. Group 5: Limitations - The model struggles to generalize its capabilities to specialized fields such as medical imaging and thermal imaging through zero-shot learning [36]. - In multi-target scenarios during video segmentation tasks, the model's real-time performance may decline, necessitating multi-GPU parallel processing [37].