分割/识别/解说一个模型搞定！3B参数刷新视觉理解SOTA，图像视频全适配

Core Viewpoint - The PAM (Perceive Anything Model) introduces a powerful model capable of segmentation, recognition, explanation, and description in a single interaction, supporting images, videos, and long videos while outputting both text and masks simultaneously [1][8]. Group 1: Model Capabilities - PAM retains the segmentation and tracking capabilities of SAM2 while providing rich semantic information, allowing users to obtain detailed descriptions of selected objects in images and videos with a single click [5][8]. - For images, PAM can output the category, explanation, and detailed description of a selected object, enhancing the understanding of visual content [11]. - In short videos, PAM tracks and segments selected objects while providing event descriptions, and for long videos, it dynamically outputs streaming descriptions based on event changes, similar to real-time subtitles [13][14]. Group 2: Training and Data - The PAM team constructed a large-scale, high-quality training dataset comprising 1.5 million image regions and 600,000 video regions, enabling the model to achieve state-of-the-art performance with only 3 billion parameters [2][21]. - The dataset includes multi-dimensional semantic annotations covering classification, explanation, description, and temporal events, allowing for precise object localization and rich semantic output [21][24]. Group 3: Performance Metrics - PAM-3B outperforms previous best models by over 3.2% in the PACO benchmark and surpasses the current state-of-the-art model DAM-8B in semantic IoU on the LVIS benchmark [25][26]. - In various benchmarks such as ImageCaption and VideoCaption, PAM demonstrates superior performance with a smaller parameter scale compared to larger models [28]. Group 4: Innovative Features - PAM introduces a novel capability for streaming video subtitles at the regional level, maintaining high semantic consistency across continuous events, showcasing significant practical application potential [30].