Workflow
语义信息输出
icon
Search documents
分割/识别/解说一个模型搞定!3B参数刷新视觉理解SOTA,图像视频全适配
量子位· 2025-06-14 08:33
Core Viewpoint - The article introduces the PAM (Perceive Anything Model), a powerful model that integrates segmentation, recognition, explanation, and description capabilities for images and videos, while providing rich semantic information alongside segmentation masks [1][8]. Group 1: Model Capabilities - PAM retains the segmentation and tracking capabilities of SAM2, allowing it to segment any object in images and videos while also providing detailed semantic information about the objects [5][8]. - Users can interact with PAM by clicking or dragging a rectangle around an object, which enables the model to output the object's category, explanation, and detailed description simultaneously [11]. - For short videos, PAM can track and segment selected objects while providing event descriptions related to those objects [13]. Group 2: Technical Specifications - PAM is trained on a large-scale, high-quality dataset consisting of 1.5 million image regions and 600,000 video regions, achieving state-of-the-art (SOTA) performance with only 3 billion parameters [2][18]. - The model employs a framework that combines SAM2's segmentation capabilities with a Semantic Perceiver and a large language model (LLM) to efficiently translate visual features into multimodal tokens [17]. - PAM's architecture allows for parallel output of segmentation masks and semantic information, ensuring both efficiency and performance [17][18]. Group 3: Performance Metrics - PAM-3B outperforms previous best models by over 3.2% on the PACO benchmark and surpasses the current SOTA model DAM-8B in semantic IoU on the LVIS benchmark [25][26]. - In various benchmarks such as ImageCaption and VideoCaption, PAM demonstrates superior performance with a smaller parameter size compared to larger models [28]. - The model's innovative streaming video subtitle capability maintains high semantic consistency across continuous events, showcasing its practical application potential [30].
分割/识别/解说一个模型搞定!3B参数刷新视觉理解SOTA,图像视频全适配
量子位· 2025-06-14 08:32
Core Viewpoint - The PAM (Perceive Anything Model) introduces a powerful model capable of segmentation, recognition, explanation, and description in a single interaction, supporting images, videos, and long videos while outputting both text and masks simultaneously [1][8]. Group 1: Model Capabilities - PAM retains the segmentation and tracking capabilities of SAM2 while providing rich semantic information, allowing users to obtain detailed descriptions of selected objects in images and videos with a single click [5][8]. - For images, PAM can output the category, explanation, and detailed description of a selected object, enhancing the understanding of visual content [11]. - In short videos, PAM tracks and segments selected objects while providing event descriptions, and for long videos, it dynamically outputs streaming descriptions based on event changes, similar to real-time subtitles [13][14]. Group 2: Training and Data - The PAM team constructed a large-scale, high-quality training dataset comprising 1.5 million image regions and 600,000 video regions, enabling the model to achieve state-of-the-art performance with only 3 billion parameters [2][21]. - The dataset includes multi-dimensional semantic annotations covering classification, explanation, description, and temporal events, allowing for precise object localization and rich semantic output [21][24]. Group 3: Performance Metrics - PAM-3B outperforms previous best models by over 3.2% in the PACO benchmark and surpasses the current state-of-the-art model DAM-8B in semantic IoU on the LVIS benchmark [25][26]. - In various benchmarks such as ImageCaption and VideoCaption, PAM demonstrates superior performance with a smaller parameter scale compared to larger models [28]. Group 4: Innovative Features - PAM introduces a novel capability for streaming video subtitles at the regional level, maintaining high semantic consistency across continuous events, showcasing significant practical application potential [30].