SAM 3
Search documents
Meta's SAM 3: AI Vision just got a HUGE UPGRADE (FREE)
Matthew Berman· 2025-12-10 19:43
Meta just dropped SAM 3 that is segment anything model and it allows you to use simple text prompting to segment anything in a video easily. Let me take a step back. There's this thing called rotoscoping.It is the extremely manual process that takes a team of dozens of people by manually segmenting different elements in a video. And now with SAM 3, it takes seconds. I'm partnering with Meta on this video to tell you about this incredible open-source open weights model that allows you to do some pretty incre ...
分割一切并不够,还要3D重建一切,SAM 3D来了
具身智能之心· 2025-11-21 00:04
Core Viewpoint - Meta has launched significant updates with the introduction of SAM 3D and SAM 3, enhancing the understanding of images in 3D and providing advanced capabilities for object detection, segmentation, and tracking in images and videos [2][6][40]. Group 1: SAM 3D Overview - SAM 3D is the latest addition to the SAM series, featuring two models: SAM 3D Objects and SAM 3D Body, both demonstrating state-of-the-art performance in converting 2D images into detailed 3D reconstructions [2][4]. - SAM 3D Objects allows users to generate 3D models from a single image, overcoming limitations of traditional 3D modeling that often relies on isolated or synthetic data [11][15]. - Meta has annotated nearly 1 million real-world images, generating approximately 3.14 million 3D meshes, utilizing a scalable data engine to enhance the quality and quantity of 3D data [20][26]. Group 2: SAM 3D Body - SAM 3D Body focuses on accurate 3D human pose and shape reconstruction from single images, maintaining high-quality performance even in complex scenarios with occlusions and unusual poses [28][30]. - The model is interactive, allowing users to guide and control predictions, enhancing accuracy and usability [29]. - A high-quality training dataset of around 8 million images was created to improve the model's performance across various 3D benchmarks [33]. Group 3: SAM 3 Capabilities - SAM 3 introduces promptable concept segmentation, enabling the model to detect and segment specific concepts based on text or example image prompts, significantly improving its performance in concept recognition [40][42]. - The architecture of SAM 3 builds on previous advancements, utilizing components like the Meta Perception Encoder and DETR for enhanced image recognition and object detection capabilities [42][44]. - SAM 3 achieves a twofold increase in cgF1 scores for concept recognition and maintains near real-time performance for images with over 100 detection targets, completing inference in approximately 30 milliseconds on H200 GPUs [44].
AI视觉GPT时刻,Meta新模型一键“分割世界”,网友直呼太疯狂了
3 6 Ke· 2025-11-20 10:04
Core Insights - Meta has launched a new family of models called SAM 3D, which includes SAM 3D Objects for object and scene reconstruction and SAM 3D Body for human shape estimation [1][12] - The SAM 3D series allows users to extract 3D models from 2D images with high accuracy, enabling 360-degree rotation without noticeable flaws [1][11] - SAM 3 introduces a new feature called "promptable concept segmentation," enhancing the model's versatility in image segmentation tasks [1][19] SAM 3D Objects - SAM 3D Objects has achieved significant advancements in 3D object reconstruction, utilizing a data annotation engine that has labeled nearly one million images to generate over 3.14 million mesh models [7][9] - The model outperforms existing leading models in human preference tests with a 5:1 advantage, enabling near-real-time 3D applications [10][11] - SAM 3D Objects can reconstruct shapes, textures, and poses of objects, allowing users to manipulate the camera for different viewing angles [11][12] SAM 3D Body - SAM 3D Body focuses on human 3D reconstruction, accurately estimating human poses and shapes from single images, even in complex scenarios [12][13] - The model supports prompt inputs, allowing users to guide predictions through segmentation masks and key points, enhancing interactivity [12][13] - SAM 3D Body has been trained on approximately 8 million high-quality samples, ensuring robustness across diverse scenarios [13][16] SAM 3 Model Features - SAM 3 is a unified model capable of detecting, segmenting, and tracking objects based on text, example images, or visual prompts, significantly improving flexibility in segmentation tasks [18][19] - The model has shown a 100% improvement in concept segmentation performance on the SA-Co benchmark compared to previous models [19][20] - Meta has implemented a collaborative data engine involving both AI and human annotators to enhance data labeling efficiency and model performance [20][23] Conclusion - The rise of generative AI is transforming computer vision (CV) capabilities, expanding the boundaries of model training and data set creation [24] - Meta is actively applying these technologies in real business scenarios, suggesting that the SAM and SAM 3D series models may yield further innovations as data and user feedback accumulate [24]
Meta「分割一切」进入3D时代!图像分割结果直出3D,有遮挡也能复原
量子位· 2025-11-20 07:01
Core Viewpoint - Meta's new 3D modeling paradigm allows for direct conversion of image segmentation results into 3D models, enhancing the capabilities of 3D reconstruction from 2D images [1][4][8]. Summary by Sections 3D Reconstruction Models - Meta's MSL lab has released SAM 3D, which includes two models: SAM 3D Objects for object and scene reconstruction, and SAM 3D Body focused on human modeling [4][8]. - SAM 3D Objects can reconstruct 3D models and estimate object poses from a single natural image, overcoming challenges like occlusion and small objects [10][11]. - SAM 3D Objects outperforms existing methods, achieving a win rate at least five times higher than leading models in direct user comparisons [13][14]. Performance Metrics - SAM 3D Objects shows significant performance improvements in 3D shape and scene reconstruction, with metrics such as F1 score of 0.2339 and 3D IoU of 0.4254 [15]. - SAM 3D Body also achieves state-of-the-art (SOTA) results in human modeling, with MPJPE of 61.7 and PCK of 75.4 across various datasets [18]. Semantic Understanding - SAM 3 introduces a concept segmentation feature that allows for flexible object segmentation based on user-defined prompts, overcoming limitations of fixed label sets [21][23]. - The model can identify and segment objects based on textual descriptions or selected examples, significantly enhancing its usability [26][31]. Benchmarking and Results - SAM 3 has set new SOTA in promptable segmentation tasks, achieving an accuracy of 47.0% in zero-shot segmentation on the LVIS dataset, surpassing the previous SOTA of 38.5% [37]. - In the new SA-Co benchmark, SAM 3's performance is at least twice as strong as baseline methods [38]. Technical Architecture - SAM 3's architecture is built on a shared Perception Encoder, which improves consistency and efficiency in feature extraction for both detection and tracking tasks [41][43]. - The model employs a two-stage generative approach for SAM 3D Objects, utilizing a 1.2 billion parameter flow-matching transformer for geometric predictions [49][50]. - SAM 3D Body utilizes a unique Momentum Human Rig representation to decouple skeletal pose from body shape, enhancing detail in human modeling [55][60].
分割一切并不够,还要3D重建一切,SAM 3D来了
机器之心· 2025-11-20 02:07
Core Insights - Meta has launched significant updates with the introduction of SAM 3D and SAM 3, enhancing the understanding of images in 3D [1][2] Group 1: SAM 3D Overview - SAM 3D is the latest addition to the SAM series, featuring two models that convert static 2D images into detailed 3D reconstructions [2][5] - SAM 3D Objects focuses on object and scene reconstruction, while SAM 3D Body specializes in human shape and pose estimation [5][28] - Meta has made the model weights and inference code for SAM 3D and SAM 3 publicly available [7] Group 2: SAM 3D Objects - SAM 3D Objects introduces a novel technical approach for robust and realistic 3D reconstruction and object pose estimation from a single natural image [11] - The model can generate detailed 3D shapes, textures, and scene layouts from everyday photos, overcoming challenges like small objects and occlusions [12][13] - Meta has annotated nearly 1 million images, generating approximately 3.14 million 3D meshes, leveraging a scalable data engine for efficient data collection [17][22] Group 3: SAM 3D Body - SAM 3D Body addresses the challenge of accurate human 3D pose and shape reconstruction from a single image, even in complex scenarios [28] - The model supports interactive input, allowing users to guide and control predictions for improved accuracy [29] - A high-quality training dataset of around 8 million images was created to enhance the model's performance across various 3D benchmarks [31] Group 4: SAM 3 Capabilities - SAM 3 introduces promptable concept segmentation, enabling the model to identify and segment instances of specific concepts based on text or example images [35] - The architecture of SAM 3 builds on previous AI advancements, utilizing Meta Perception Encoder for enhanced image recognition and object detection [37] - SAM 3 has achieved a twofold improvement in concept segmentation performance compared to existing models, with rapid inference times even for images with numerous detection targets [39]
ICLR 2026惊现SAM 3,分割一切的下一步:让模型理解「概念」
具身智能之心· 2025-10-14 00:02
Core Viewpoint - The article discusses the release of the paper "SAM 3: Segment Anything with Concepts" by Meta, which introduces advancements in the field of computer vision, particularly in promptable concept segmentation [3][5][9]. Summary by Sections Introduction - The paper "SAM 3" has gained significant attention, suggesting it is a continuation of Meta's "Segment Anything" series, following the previous versions SAM 1 and SAM 2 [3][5][6]. Key Developments - SAM 3 introduces a new task called Promptable Concept Segmentation (PCS), allowing users to input text or image examples to predict instance and semantic masks for matching objects while maintaining identity consistency across video frames [9][17]. - The focus is on identifying atomic visual concepts, enabling the model to understand simple noun phrases like "red apple" or "striped cat" for segmentation [9][12]. Performance Improvements - SAM 3 shows significant performance improvements over SAM 2, achieving at least a 2x enhancement on the new benchmark SA-Co, with a zero-shot mask average precision of 47.0 on the LVIS dataset, surpassing the previous best of 38.5 [13][14]. - The model processes images with over 100 objects in just 30 milliseconds on a single H200 GPU [14]. Methodology - SAM 3 is built on a dual encoder-decoder transformer architecture, integrating a detector with a tracker and memory module for video applications [19]. - A scalable human-machine collaborative data engine was developed, annotating a high-quality training dataset with 4 million unique phrases and 520 million masks [20]. Benchmarking and Results - SAM 3 outperforms previous models in various benchmarks, including achieving a CGF score that is double that of the strongest baseline OWLv2 on the open vocabulary SA-Co/Gold dataset [28]. - In multiple public benchmarks, SAM 3 consistently exceeds the performance of strong expert baselines, demonstrating its effectiveness in instance segmentation and object detection tasks [27][30]. Conclusion - The advancements in SAM 3 position it as a leading model in the field of computer vision, particularly in the area of promptable segmentation, showcasing Meta's commitment to pushing the boundaries of AI technology [9][12][19].
ICLR神秘论文曝光,SAM3用「概念」看世界,重构视觉AI新范式
3 6 Ke· 2025-10-13 23:57
Core Insights - The upcoming upgrade of the SAM model, SAM 3, focuses on "concept-based segmentation," allowing for segmentation based on semantic concepts rather than just pixels or instances [6][8][15] - SAM 3 introduces a new standard called Promptable Concept Segmentation (PCS), enabling the model to identify and segment all objects that fit a given concept across various images and videos [8][12][16] - The model has been trained on a vast dataset, including approximately 4 million unique concept labels, enhancing its ability to understand and segment based on user prompts [6][11][27] Group 1: SAM 3 Features - SAM 3 emphasizes interactive refinement of segmentation results, allowing users to provide additional prompts to clarify ambiguous cases [8][11] - The model can track multiple instances of the same concept across different frames in a video, improving its utility in dynamic environments [8][12] - SAM 3 achieves significant performance improvements, with a zero-shot segmentation accuracy of 47.0 on the LVIS dataset, surpassing the previous best of 38.5 [11][28] Group 2: Data Engine and Training - A human-AI collaborative data engine has been developed to enhance the training process, allowing the model to learn from its mistakes and improve accuracy [19][22] - The data engine consists of four phases, starting with human validation and progressing to AI-assisted validation and video annotation [21][25] - The final dataset, SA-Co, includes 126,000 samples and 214,000 unique phrases, making it one of the largest open vocabulary segmentation datasets available [28] Group 3: Concept Segmentation Challenges - PCS faces challenges due to the vast range of possible concepts, leading to ambiguities that the model must navigate [14] - To address these ambiguities, SAM 3 employs multi-expert annotations and optimized evaluation protocols to ensure objectivity and accuracy [14][19] - The model includes a dedicated "ambiguity module" to help it understand and tolerate vague boundaries in concept definitions [14][19]
腾讯研究院AI速递 20251014
腾讯研究院· 2025-10-13 17:53
Group 1: OpenAI and Chip Partnerships - OpenAI has announced a strategic partnership with Broadcom to deploy 100 billion watts of custom AI chips designed by OpenAI, with deployment starting in the second half of 2026 and completion by the end of 2029 [1] - This marks OpenAI's third significant deal with a chip giant in a month, following a $100 billion investment from NVIDIA and a $60 billion GPU deployment agreement with AMD [1] - Sam Altman revealed that both companies have been designing the new chip over the past 18 months, utilizing OpenAI's own models in the design process, leading to a significant increase in Broadcom's stock price by over 10% after the announcement [1] Group 2: Google Gemini 3.0 Update - Google is set to release Gemini 3.0 on October 22, showcasing impressive front-end development capabilities that can generate web pages, games, and original music with a single click [2] - Gemini 3.0 employs a MoE architecture with over a trillion parameters, activating 15-20 billion parameters per query, and can handle context from 1 million to several million tokens, enabling it to process entire books and codebases [2] - Internal tests indicate that Gemini 3.0 outperformed in front-end tests, including generating 3D pixel art, with a year-on-year growth rate of 46.24% expected by September 2025 [2] Group 3: LiblibAI 2.0 Upgrade - LiblibAI 2.0 has integrated over 10 popular video models and numerous image models, allowing users to complete all AI creative tasks within the platform [3] - The upgrade includes a one-click video effect feature and seamless switching between image generation and video creation, incorporating models like Midjourney V7 and Qwen-image [3] - New asset management and AI toolbox features have been added, providing a comprehensive AI experience for both new and existing users [3] Group 4: Mamba-3 Development - The third generation of Mamba, Mamba-3, has entered blind review for ICLR 2026, featuring innovations such as trapezoidal rule discretization, complex state spaces, and multi-input multi-output design [4][5] - Mamba-3 introduces complex hidden states to handle periodic patterns and parity checks, significantly enhancing arithmetic intensity to fully utilize GPU capabilities [5] - It has shown excellent performance in long-context information retrieval tests, with reduced inference latency, making it suitable for long text processing, real-time interaction, and edge computing applications [5] Group 5: SAM 3 Concept Segmentation - The suspected Meta-developed SAM 3 paper has been submitted to ICLR 2026, achieving prompt concept segmentation (PCS) that allows users to segment matching instances using simple noun phrases or image examples [6] - SAM 3 has demonstrated at least a twofold performance improvement on the SA-Co benchmark, achieving an average precision of 47.0 on the LVIS dataset, surpassing the previous record of 38.5 [6] - It utilizes a dual encoder-decoder transformer architecture, built on a high-quality training dataset containing 4 million unique phrases and 52 million masks, processing over 100 object images in just 30 milliseconds on a single H200 GPU [6] Group 6: Google's ReasoningBank Framework - Google has introduced the ReasoningBank memory framework, which extracts memory items from the successes and failures of agents to form a closed-loop self-evolution system that learns without real labels [7] - The framework incorporates memory-aware testing time expansion (MaTTS) to generate diverse explorations through parallel and sequential setups, enhancing the synthesis of more universal memories [7] - ReasoningBank has shown a 34.2% improvement in effectiveness and a 16.0% reduction in interaction steps in benchmark tests such as WebArena, Mind2Web, and SWE-Bench-Verified [7] Group 7: AI Performance in Astronomy - Recent studies indicate that GPT-5 and Gemini 2.5 Pro achieved gold medal results in the International Olympiad on Astronomy and Astrophysics (IOAA), with GPT-5 scoring an average of 84.2% in theoretical exams [8] - Both models outperformed the best students in theoretical exams, although their accuracy in geometric/spatial problems (49-78%) was notably lower than in physics/mathematics problems (67-91%) [8] - This highlights AI's strong reasoning capabilities not only in mathematics but also in astronomy and astrophysics, approaching top human-level performance across multiple scientific domains [8] Group 8: Unitree G1 Robot Developments - The Unitree G1 robot has demonstrated advanced movements such as aerial flips and kung fu techniques, showcasing its agility and capabilities [10] - Unitree plans to launch a humanoid robot standing 1.8 meters tall in the second half of this year, having applied for nearly 10 patents related to humanoid robots [10] - The domestic robotics industry has seen an average growth rate of 50%-100% in the first half of this year, with algorithm upgrades enabling robots to theoretically perform various dance and martial arts movements [10] Group 9: Apple AI Glasses - Bloomberg reports that Apple's smart glasses may run a full version of visionOS when paired with a Mac and switch to a lightweight mobile interface when connected to an iPhone, with a planned release between 2026 and 2027 [11] - Apple has shifted focus from developing a lighter "Vision Air" headset to smart glasses, directly competing with Meta's Ray-Ban Display [11] - The first generation of the product will not feature a display but will include audio speakers, cameras, voice control, and potential health functionalities, with plans for a multi-tiered product line in the future [11] Group 10: Sam Altman's Insights on AI and Work - Sam Altman stated in a recent interview that AI will change the nature of work but will not eliminate true jobs, suggesting that future work may become easier while human intrinsic motivation remains [12] - Regarding the development of GPT-6, the focus will be on creating smarter models with longer context and better memory capabilities, with Codex already capable of completing full-day tasks [12] - OpenAI currently has 800 million active users weekly, and Altman believes that voice will not be the ultimate form of AI interaction, with the team working on a new voice interaction device that will not be revealed in the short term [12]
ICLR 2026惊现SAM 3,分割一切的下一步:让模型理解「概念」
机器之心· 2025-10-13 04:21
Core Insights - The article discusses the release of a new paper titled "SAM 3: Segment Anything with Concepts," which is believed to be a continuation of Meta's "Segment Anything" series, following SAM 1 and SAM 2 [1][3][4]. Group 1: Overview of SAM 3 - SAM 3 introduces a new task called Promptable Concept Segmentation (PCS), allowing users to input text or image examples to predict instance and semantic masks for matching objects while maintaining identity consistency across video frames [8][12]. - The model focuses on identifying atomic visual concepts, enabling it to understand simple noun phrases like "red apple" or "striped cat" for segmentation tasks [8][12]. - SAM 3 improves upon its predecessors by enhancing performance in promptable visual segmentation and establishing new standards for PCS [18]. Group 2: Performance Metrics - SAM 3 shows significant performance improvements, achieving at least a 2x enhancement on the newly proposed SA-Co benchmark compared to previous systems [13]. - In the LVIS dataset, SAM 3 achieved a zero-shot mask average precision of 47.0, surpassing the previous best of 38.5 [13]. - The model processes images with over 100 objects in just 30 milliseconds on a single H200 GPU [14]. Group 3: Methodology and Data - SAM 3 employs a dual encoder-decoder transformer architecture, integrating a detector with a tracker and memory module for video applications [20]. - The research developed a scalable human-machine collaborative data engine, annotating a high-quality training dataset with 4 million unique phrases and 520 million masks [21]. - The PCS benchmark includes 124K images and 1.7K videos with 214K unique concepts, significantly expanding the concept count compared to existing benchmarks [25]. Group 4: Comparative Analysis - SAM 3 outperforms previous models in various tasks, including instance segmentation, box detection, and semantic segmentation across multiple datasets [27][28]. - In open vocabulary semantic segmentation experiments, SAM 3 exceeded the performance of strong baseline models [29]. - The model also demonstrated superior object counting accuracy and segmentation capabilities compared to other models [33].
Meta「分割一切」3.0曝光,技能语义分割加入概念提示,好好玩,要爆了
3 6 Ke· 2025-10-13 03:52
Core Insights - The article discusses the introduction of SAM 3, a third-generation segmentation model that can understand natural language prompts for image and video segmentation tasks [1][3][5]. Group 1: Model Capabilities - SAM 3 can segment images and videos based on user-defined phrases, allowing for more interactive and intuitive segmentation tasks [3][6]. - The model processes images containing over 100 objects in just 30 milliseconds, demonstrating near real-time capabilities for video processing [5][21]. - SAM 3 introduces a new task paradigm called Promptable Concept Segmentation (PCS), which allows for multi-instance segmentation based on various input prompts [6][7]. Group 2: Technical Innovations - The architecture of SAM 3 includes a new detection module based on the Deformable Transformer (DETR), which separates object recognition and localization tasks to enhance detection accuracy [11]. - A scalable data engine was developed to create a training dataset with 4 million unique concept labels and 52 million validated masks, improving the model's performance [12]. - The SA-Co benchmark was introduced to evaluate the model's performance in open vocabulary segmentation tasks, significantly expanding the concept coverage compared to existing benchmarks [13]. Group 3: Performance Metrics - SAM 3 achieved a 47.0% accuracy in zero-shot segmentation tasks on the LVIS dataset, surpassing the previous state-of-the-art (SOTA) of 38.5% [16]. - In the new SA-Co benchmark, SAM 3's performance is at least twice as strong as baseline methods [16]. - The model also outperformed SAM 2 in video segmentation tasks, indicating significant improvements in performance [18]. Group 4: Future Directions - Researchers are exploring the combination of SAM 3 with multimodal large models (MLLM) to tackle more complex segmentation tasks, such as identifying specific scenarios in images [19]. - Despite its advancements, SAM 3 still faces challenges in generalizing to specialized fields like medical imaging and thermal imaging through zero-shot learning [21].