分割/识别/解说一个模型搞定！3B参数刷新视觉理解SOTA，图像视频全适配

Core Viewpoint - The article introduces the PAM (Perceive Anything Model), a powerful model that integrates segmentation, recognition, explanation, and description capabilities for images and videos, while providing rich semantic information alongside segmentation masks [1][8]. Group 1: Model Capabilities - PAM retains the segmentation and tracking capabilities of SAM2, allowing it to segment any object in images and videos while also providing detailed semantic information about the objects [5][8]. - Users can interact with PAM by clicking or dragging a rectangle around an object, which enables the model to output the object's category, explanation, and detailed description simultaneously [11]. - For short videos, PAM can track and segment selected objects while providing event descriptions related to those objects [13]. Group 2: Technical Specifications - PAM is trained on a large-scale, high-quality dataset consisting of 1.5 million image regions and 600,000 video regions, achieving state-of-the-art (SOTA) performance with only 3 billion parameters [2][18]. - The model employs a framework that combines SAM2's segmentation capabilities with a Semantic Perceiver and a large language model (LLM) to efficiently translate visual features into multimodal tokens [17]. - PAM's architecture allows for parallel output of segmentation masks and semantic information, ensuring both efficiency and performance [17][18]. Group 3: Performance Metrics - PAM-3B outperforms previous best models by over 3.2% on the PACO benchmark and surpasses the current SOTA model DAM-8B in semantic IoU on the LVIS benchmark [25][26]. - In various benchmarks such as ImageCaption and VideoCaption, PAM demonstrates superior performance with a smaller parameter size compared to larger models [28]. - The model's innovative streaming video subtitle capability maintains high semantic consistency across continuous events, showcasing its practical application potential [30].