Workflow
SAM 3
icon
Search documents
ICLR 2026惊现SAM 3,分割一切的下一步:让模型理解「概念」
具身智能之心· 2025-10-14 00:02
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 说出概念,SAM 3 就明白你在说什么,并在所有出现的位置精确描绘出边界。 Meta 的「分割一切」再上新? 9 月 12 日,一篇匿名论文「SAM 3: SEGMENT ANYTHING WITH CONCEPTS」登陆 ICLR 2026,引发网友广泛关注。 大家纷纷猜测,这篇论文出自 Meta,毕竟文风和 Meta 以前发布的论文非常相似。再加上 SAM 与 SAM 2 均由 Meta 推出,这让外界几乎可以确定,SAM 3 就是 Meta「Segment Anything」系列的正式续作。 即将文本和 / 或图像范例作为输入,为每一个与该概念匹配的对象预测实例掩码和语义掩码,同时在视频帧之间保持对象身份的一致性。该工作的重点是识别原子 视觉概念 (atomic visual concepts),因此 将输入文本限制为简单的名词短语,例如「红苹果」或「条纹猫」,只要描述你想要的东西 ...
ICLR神秘论文曝光,SAM3用「概念」看世界,重构视觉AI新范式
3 6 Ke· 2025-10-13 23:57
Core Insights - The upcoming upgrade of the SAM model, SAM 3, focuses on "concept-based segmentation," allowing for segmentation based on semantic concepts rather than just pixels or instances [6][8][15] - SAM 3 introduces a new standard called Promptable Concept Segmentation (PCS), enabling the model to identify and segment all objects that fit a given concept across various images and videos [8][12][16] - The model has been trained on a vast dataset, including approximately 4 million unique concept labels, enhancing its ability to understand and segment based on user prompts [6][11][27] Group 1: SAM 3 Features - SAM 3 emphasizes interactive refinement of segmentation results, allowing users to provide additional prompts to clarify ambiguous cases [8][11] - The model can track multiple instances of the same concept across different frames in a video, improving its utility in dynamic environments [8][12] - SAM 3 achieves significant performance improvements, with a zero-shot segmentation accuracy of 47.0 on the LVIS dataset, surpassing the previous best of 38.5 [11][28] Group 2: Data Engine and Training - A human-AI collaborative data engine has been developed to enhance the training process, allowing the model to learn from its mistakes and improve accuracy [19][22] - The data engine consists of four phases, starting with human validation and progressing to AI-assisted validation and video annotation [21][25] - The final dataset, SA-Co, includes 126,000 samples and 214,000 unique phrases, making it one of the largest open vocabulary segmentation datasets available [28] Group 3: Concept Segmentation Challenges - PCS faces challenges due to the vast range of possible concepts, leading to ambiguities that the model must navigate [14] - To address these ambiguities, SAM 3 employs multi-expert annotations and optimized evaluation protocols to ensure objectivity and accuracy [14][19] - The model includes a dedicated "ambiguity module" to help it understand and tolerate vague boundaries in concept definitions [14][19]
腾讯研究院AI速递 20251014
腾讯研究院· 2025-10-13 17:53
生成式AI 一、 刚刚,OpenAI官宣自研AI芯片!博通股价飙涨10% 1. OpenAI与博通达成战略合作,将部署100亿瓦OpenAI设计的定制AI芯片,博通计划2026 年下半年开始部署并于2029年底完成; 2. 这是OpenAI一个月内与第三家芯片巨头的重磅交易,此前已宣布英伟达1000亿美元投资 和AMD 60亿瓦GPU部署协议; 3. Sam Altman透露双方过去18个月一直在设计新芯片,使用OpenAI自己的模型参与设计, 消息公布后博通股价一度涨超10%。 https://mp.weixin.qq.com/s/1VqWsC2R2dpIwYVxyF3Jlg 二、 谷歌Gemini 3.0「全家桶」更新预告,前端不再需要人类 3. 增加资产管理菜单和AI工具箱入口,集合高清放大、抠图、产品精修等大量模型工作流, 为新老用户提供一站式AI体验。 https://mp.weixin.qq.com/s/CzMtMYyCEdqoRU2lTsCJ0g 四、 Mamba的最新进化版本Mamba-3来了,ICLR 2026 1. 谷歌Gemini 3.0预计10月22日发布,内测人士放出惊艳demo显示 ...
ICLR 2026惊现SAM 3,分割一切的下一步:让模型理解「概念」
机器之心· 2025-10-13 04:21
Core Insights - The article discusses the release of a new paper titled "SAM 3: Segment Anything with Concepts," which is believed to be a continuation of Meta's "Segment Anything" series, following SAM 1 and SAM 2 [1][3][4]. Group 1: Overview of SAM 3 - SAM 3 introduces a new task called Promptable Concept Segmentation (PCS), allowing users to input text or image examples to predict instance and semantic masks for matching objects while maintaining identity consistency across video frames [8][12]. - The model focuses on identifying atomic visual concepts, enabling it to understand simple noun phrases like "red apple" or "striped cat" for segmentation tasks [8][12]. - SAM 3 improves upon its predecessors by enhancing performance in promptable visual segmentation and establishing new standards for PCS [18]. Group 2: Performance Metrics - SAM 3 shows significant performance improvements, achieving at least a 2x enhancement on the newly proposed SA-Co benchmark compared to previous systems [13]. - In the LVIS dataset, SAM 3 achieved a zero-shot mask average precision of 47.0, surpassing the previous best of 38.5 [13]. - The model processes images with over 100 objects in just 30 milliseconds on a single H200 GPU [14]. Group 3: Methodology and Data - SAM 3 employs a dual encoder-decoder transformer architecture, integrating a detector with a tracker and memory module for video applications [20]. - The research developed a scalable human-machine collaborative data engine, annotating a high-quality training dataset with 4 million unique phrases and 520 million masks [21]. - The PCS benchmark includes 124K images and 1.7K videos with 214K unique concepts, significantly expanding the concept count compared to existing benchmarks [25]. Group 4: Comparative Analysis - SAM 3 outperforms previous models in various tasks, including instance segmentation, box detection, and semantic segmentation across multiple datasets [27][28]. - In open vocabulary semantic segmentation experiments, SAM 3 exceeded the performance of strong baseline models [29]. - The model also demonstrated superior object counting accuracy and segmentation capabilities compared to other models [33].
Meta「分割一切」3.0曝光,技能语义分割加入概念提示,好好玩,要爆了
3 6 Ke· 2025-10-13 03:52
Core Insights - The article discusses the introduction of SAM 3, a third-generation segmentation model that can understand natural language prompts for image and video segmentation tasks [1][3][5]. Group 1: Model Capabilities - SAM 3 can segment images and videos based on user-defined phrases, allowing for more interactive and intuitive segmentation tasks [3][6]. - The model processes images containing over 100 objects in just 30 milliseconds, demonstrating near real-time capabilities for video processing [5][21]. - SAM 3 introduces a new task paradigm called Promptable Concept Segmentation (PCS), which allows for multi-instance segmentation based on various input prompts [6][7]. Group 2: Technical Innovations - The architecture of SAM 3 includes a new detection module based on the Deformable Transformer (DETR), which separates object recognition and localization tasks to enhance detection accuracy [11]. - A scalable data engine was developed to create a training dataset with 4 million unique concept labels and 52 million validated masks, improving the model's performance [12]. - The SA-Co benchmark was introduced to evaluate the model's performance in open vocabulary segmentation tasks, significantly expanding the concept coverage compared to existing benchmarks [13]. Group 3: Performance Metrics - SAM 3 achieved a 47.0% accuracy in zero-shot segmentation tasks on the LVIS dataset, surpassing the previous state-of-the-art (SOTA) of 38.5% [16]. - In the new SA-Co benchmark, SAM 3's performance is at least twice as strong as baseline methods [16]. - The model also outperformed SAM 2 in video segmentation tasks, indicating significant improvements in performance [18]. Group 4: Future Directions - Researchers are exploring the combination of SAM 3 with multimodal large models (MLLM) to tackle more complex segmentation tasks, such as identifying specific scenarios in images [19]. - Despite its advancements, SAM 3 still faces challenges in generalizing to specialized fields like medical imaging and thermal imaging through zero-shot learning [21].
Meta「分割一切」3.0曝光!技能语义分割加入概念提示,好好玩,要爆了
量子位· 2025-10-13 03:35
Core Viewpoint - The article discusses the introduction of SAM 3, a third-generation segmentation model that enhances interactive segmentation capabilities by understanding natural language prompts, allowing for more intuitive and flexible image and video segmentation tasks [3][6][10]. Group 1: Model Features - SAM 3 introduces a new task paradigm called Promptable Concept Segmentation (PCS), enabling the model to segment instances in images or videos based on phrases or image examples [11][12]. - The model supports open vocabulary, allowing users to input any noun phrase as a segmentation target, and can maintain identity consistency across video frames [17]. - SAM 3's architecture includes a Presence Head module that decouples object recognition and localization tasks, improving performance in multi-instance segmentation [16][17]. Group 2: Data Engine and Benchmark - A scalable data engine was built to enhance PCS, generating a training dataset with 4 million unique concept labels and 52 million verified masks [19]. - The SA-Co benchmark was introduced to evaluate the model's performance in open vocabulary segmentation tasks, containing 214,000 unique concepts and covering 50 times more than existing benchmarks [23][24]. Group 3: Performance Metrics - SAM 3 achieved a 47.0% accuracy in zero-shot segmentation tasks on the LVIS dataset, surpassing the previous state-of-the-art (SOTA) of 38.5% [28]. - In the new SA-Co benchmark, SAM 3's performance was at least twice as strong as baseline methods [29]. - The model demonstrated superior performance in video segmentation tasks compared to its predecessor, SAM 2 [30]. Group 4: Real-time Processing - SAM 3 can process images with over 100 entities in approximately 30 milliseconds on H200 GPUs, maintaining near real-time performance for about five concurrent targets in video tasks [35]. Group 5: Limitations - The model struggles to generalize its capabilities to specialized fields such as medical imaging and thermal imaging through zero-shot learning [36]. - In multi-target scenarios during video segmentation tasks, the model's real-time performance may decline, necessitating multi-GPU parallel processing [37].