Workflow
SAM2
icon
Search documents
ICRA 2026 | 机器人也会整理衣柜了!北大董豪团队新研究突破杂乱衣物精准抓取难题
机器之心· 2026-03-28 02:30
大家好,我是北大图灵班大一本科生李铭乐洋,很高兴代表团队分享我们的最新探索成果 GarmentPile++。这项工作由北大长聘副教授&上纬启元首席科学家董豪 老师带领完成,刚刚被机器人领域顶会 ICRA 2026 接收。同时,在不久前《EAI-100 具身智能领域 2025 年度百项代表性成果与人物》中,该项研究获得 10 大 Demo 项目奖。 在现实环境中,衣物通常是杂乱堆放的,因此对衣物的智能检索成为必要。我们提出 GarmentPile++,相对前序工作 GarmentPile 更加高效,同时支持检 索特定语言指令对应的衣物,为下游单衣物操作贡献了良好的上游基础。 衣物操控是家庭服务机器人的⼀项关键能力。然而,现实环境中的衣物通常是以杂乱的堆叠形式存在的。由于衣物的高度柔性、状态空间近乎无限的特性以 及复杂的动力学属性,从杂乱堆叠中精准检索并抓取特定衣物成为⼀项有挑战的任务。 近期工作(如 GarmentPile)尝试解决这⼀问题。但现有的方法主要依赖单⼀的视觉可供性,缺乏对语言指令的理解能力,且⼤多局限于单臂操作,难以处 理大型或长条状衣物。针对上述痛点,我们提出了 GarmentPile++。这是 ...
TPAMI | DC-SAM:打破SAM交互限制,基于循环一致性的图像与视频上下文分割方法
机器之心· 2026-01-20 04:51
上下文分割(In-Context Segmentation)旨在通过参考示例指导模型实现对特定目标的自动化分割。尽管 SAM 凭借卓越的零样本泛化能力为此提供了强大的基础, 但将其应用于此仍受限于提示(如点或框)构建,这样的需求不仅制约了批量推理的自动化效率,更使得模型在处理复杂的连续视频时,难以维持时空一致性。 北京邮电大学联合南洋理工大学等 机构发表的 IEEE TPAMI 期刊论文《DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency》,不仅为 图像和视频的上下文分割建立了统一的高效框架 DC-S A M ,还构建了首个视频上下文分割基准 IC-VOS 。 研究团队巧妙地提出基于提示微调的 "循环一致性" 机制,通过正负双分支与循环一致性注意力的协同,配合 Mask-Tube 策略,实现了 SAM 与 SAM2 在图像及视 频上下文分割任务上的统一与高效适配。 实验结果显示,DC-SAM 在多个基准测试中均取得了 SOTA 性能:在 COCO-20 上达到 55.5 mIoU,在 Pascal-5 上达 ...
从SAM1到SAM3,Meta做了什么?
自动驾驶之心· 2025-12-06 03:04
Core Insights - Meta has made significant advancements in AI, particularly in visual models, with the release of SAM1, SAM2, and SAM3, marking a new era in computer vision technology [1][25]. Summary by Sections SAM1 to SAM3 Evolution - SAM1 introduced Promptable Visual Segmentation (PVS), allowing image segmentation through simple prompts like clicks or semantic hints [1]. - SAM2 optimized the architecture for better video segmentation and dynamic scene support, enhancing stability and accuracy [3]. - SAM3 achieved unprecedented accuracy with multi-modal support, enabling segmentation through voice, text, and images, and introduced Promptable Concept Segmentation (PCS) for complex object recognition [3][4]. Technical Specifications - SAM1 had a smaller model size suitable for real-time inference, while SAM2 improved efficiency and SAM3 enhanced computational capabilities for complex tasks [4]. - SAM3 supports real-time video and image segmentation with multi-object tracking, showcasing its advanced capabilities [4]. - The model allows for long-context semantic reasoning, improving video scene analysis [4]. Concept Segmentation - SAM3 can identify and segment objects based on user-defined concepts, such as "striped cat," demonstrating its flexibility and precision [7][11]. - The model utilizes positive and negative examples to refine segmentation results, enhancing accuracy [10]. Performance Metrics - SAM3 outperformed previous models in various segmentation tasks, achieving high scores across datasets like LVIS, COCO, and others [21][23]. - The model's zero-shot performance was notable, effectively handling tasks without extensive training data [29]. Multi-modal Capabilities - SAM3's integration with MLLMs (Multi-Modal Language Models) allows for complex text queries, enhancing its object segmentation tasks [21][29]. - The model's ability to combine text and image inputs significantly improves segmentation outcomes, showcasing its strength in multi-modal tasks [23]. Conclusion - The advancements from SAM1 to SAM3 reflect Meta's strategic push in the visual AI domain, reshaping various applications in everyday life, including autonomous driving and video surveillance [25][26].
分割/识别/解说一个模型搞定!3B参数刷新视觉理解SOTA,图像视频全适配
量子位· 2025-06-14 08:33
Core Viewpoint - The article introduces the PAM (Perceive Anything Model), a powerful model that integrates segmentation, recognition, explanation, and description capabilities for images and videos, while providing rich semantic information alongside segmentation masks [1][8]. Group 1: Model Capabilities - PAM retains the segmentation and tracking capabilities of SAM2, allowing it to segment any object in images and videos while also providing detailed semantic information about the objects [5][8]. - Users can interact with PAM by clicking or dragging a rectangle around an object, which enables the model to output the object's category, explanation, and detailed description simultaneously [11]. - For short videos, PAM can track and segment selected objects while providing event descriptions related to those objects [13]. Group 2: Technical Specifications - PAM is trained on a large-scale, high-quality dataset consisting of 1.5 million image regions and 600,000 video regions, achieving state-of-the-art (SOTA) performance with only 3 billion parameters [2][18]. - The model employs a framework that combines SAM2's segmentation capabilities with a Semantic Perceiver and a large language model (LLM) to efficiently translate visual features into multimodal tokens [17]. - PAM's architecture allows for parallel output of segmentation masks and semantic information, ensuring both efficiency and performance [17][18]. Group 3: Performance Metrics - PAM-3B outperforms previous best models by over 3.2% on the PACO benchmark and surpasses the current SOTA model DAM-8B in semantic IoU on the LVIS benchmark [25][26]. - In various benchmarks such as ImageCaption and VideoCaption, PAM demonstrates superior performance with a smaller parameter size compared to larger models [28]. - The model's innovative streaming video subtitle capability maintains high semantic consistency across continuous events, showcasing its practical application potential [30].
分割/识别/解说一个模型搞定!3B参数刷新视觉理解SOTA,图像视频全适配
量子位· 2025-06-14 08:32
Core Viewpoint - The PAM (Perceive Anything Model) introduces a powerful model capable of segmentation, recognition, explanation, and description in a single interaction, supporting images, videos, and long videos while outputting both text and masks simultaneously [1][8]. Group 1: Model Capabilities - PAM retains the segmentation and tracking capabilities of SAM2 while providing rich semantic information, allowing users to obtain detailed descriptions of selected objects in images and videos with a single click [5][8]. - For images, PAM can output the category, explanation, and detailed description of a selected object, enhancing the understanding of visual content [11]. - In short videos, PAM tracks and segments selected objects while providing event descriptions, and for long videos, it dynamically outputs streaming descriptions based on event changes, similar to real-time subtitles [13][14]. Group 2: Training and Data - The PAM team constructed a large-scale, high-quality training dataset comprising 1.5 million image regions and 600,000 video regions, enabling the model to achieve state-of-the-art performance with only 3 billion parameters [2][21]. - The dataset includes multi-dimensional semantic annotations covering classification, explanation, description, and temporal events, allowing for precise object localization and rich semantic output [21][24]. Group 3: Performance Metrics - PAM-3B outperforms previous best models by over 3.2% in the PACO benchmark and surpasses the current state-of-the-art model DAM-8B in semantic IoU on the LVIS benchmark [25][26]. - In various benchmarks such as ImageCaption and VideoCaption, PAM demonstrates superior performance with a smaller parameter scale compared to larger models [28]. Group 4: Innovative Features - PAM introduces a novel capability for streaming video subtitles at the regional level, maintaining high semantic consistency across continuous events, showcasing significant practical application potential [30].