从SAM1到SAM3，Meta做了什么？

Core Insights - Meta has made significant advancements in AI, particularly in visual models, with the release of SAM1, SAM2, and SAM3, marking a new era in computer vision technology [1][25]. Summary by Sections SAM1 to SAM3 Evolution - SAM1 introduced Promptable Visual Segmentation (PVS), allowing image segmentation through simple prompts like clicks or semantic hints [1]. - SAM2 optimized the architecture for better video segmentation and dynamic scene support, enhancing stability and accuracy [3]. - SAM3 achieved unprecedented accuracy with multi-modal support, enabling segmentation through voice, text, and images, and introduced Promptable Concept Segmentation (PCS) for complex object recognition [3][4]. Technical Specifications - SAM1 had a smaller model size suitable for real-time inference, while SAM2 improved efficiency and SAM3 enhanced computational capabilities for complex tasks [4]. - SAM3 supports real-time video and image segmentation with multi-object tracking, showcasing its advanced capabilities [4]. - The model allows for long-context semantic reasoning, improving video scene analysis [4]. Concept Segmentation - SAM3 can identify and segment objects based on user-defined concepts, such as "striped cat," demonstrating its flexibility and precision [7][11]. - The model utilizes positive and negative examples to refine segmentation results, enhancing accuracy [10]. Performance Metrics - SAM3 outperformed previous models in various segmentation tasks, achieving high scores across datasets like LVIS, COCO, and others [21][23]. - The model's zero-shot performance was notable, effectively handling tasks without extensive training data [29]. Multi-modal Capabilities - SAM3's integration with MLLMs (Multi-Modal Language Models) allows for complex text queries, enhancing its object segmentation tasks [21][29]. - The model's ability to combine text and image inputs significantly improves segmentation outcomes, showcasing its strength in multi-modal tasks [23]. Conclusion - The advancements from SAM1 to SAM3 reflect Meta's strategic push in the visual AI domain, reshaping various applications in everyday life, including autonomous driving and video surveillance [25][26].