Workflow
Describe Anything Model (DAM)
icon
Search documents
PixelRefer :让AI从“看大图”走向“看懂每个对象”
机器之心· 2025-11-10 23:47
Core Insights - The article discusses the limitations of current multimodal large models (MLLMs) in achieving fine-grained, object-level understanding necessary for real-world applications like autonomous driving and medical imaging, highlighting the need for a more detailed visual understanding framework [2][38] - PixelRefer is introduced as an innovative solution that provides a unified spatio-temporal understanding framework capable of fine visual referencing and reasoning at arbitrary granularity, outperforming existing models in several benchmarks [2][38] Model Overview - PixelRefer integrates global visual tokens, pixel-level region tokens, and text tokens into a large language model (LLM), maintaining both scene context and object-level reasoning capabilities [16][22] - The model's lightweight version, PixelRefer-Lite, achieves a 4x increase in inference speed and reduces memory usage by half compared to existing models like DAM-3B [2][33] Methodology - The authors propose two frameworks for pixel-level fine-grained understanding: Vision-Object Framework and Object-Only Framework, emphasizing the importance of high-quality pixel-level object representation [15][22] - A Scale-Adaptive Object Tokenizer (SAOT) is introduced to generate precise and compact object representations, addressing challenges related to small and large object details [17][16] Performance Metrics - PixelRefer has achieved state-of-the-art (SOTA) performance across various image understanding benchmarks, including PACO and DLC-Bench, with notable advantages in reasoning scenarios [28][30] - In video pixel-level understanding benchmarks, PixelRefer also demonstrates superior performance, particularly in tasks involving video captioning and question answering [29][31] Applications and Future Directions - The advancements presented by PixelRefer signify a shift towards understanding the dynamic details of the world, with potential applications in autonomous driving, medical imaging, intelligent video editing, and multimodal dialogue systems [38][40]
超越英伟达Describe Anything!中科院 & 字节联合提出「GAR」,为DeepSeek-OCR添砖加瓦
量子位· 2025-10-28 05:12
Core Insights - The article discusses the innovative approach "Vision as Context Compression" proposed by DeepSeek-OCR, focusing on using OCR capabilities to compress documents through images [1] - The collaboration between the Chinese Academy of Sciences and ByteDance introduces "Grasp Any Region" (GAR), which explores the potential of natural images as a means of text compression [2] - GAR's precise region captioning capability is highlighted as a potential pathway for constructing dense captions for natural images [4] Summary by Sections GAR Capabilities - GAR possesses three main abilities: accurately describing user-specified regions, modeling relationships between multiple regions, and performing complex combinatorial reasoning [5][7] - The model allows users to provide various visual prompts and instructions for precise understanding of specific regions [9][10] Importance of Region MLLMs - Region MLLMs differ from traditional MLLMs by enabling fine-grained, interactive understanding of image/video content [8] - The article emphasizes the challenge of evaluating full-image captions, while region captions can be objectively assessed based on color, texture, shape, and material [12] Trade-off Between Local and Global Information - The article discusses the dilemma faced by Region MLLMs in balancing local details and global context [15] - Examples are provided to illustrate how GAR outperforms other models like DAM in accurately identifying and describing specified regions [18][19] Model Design and Mechanism - GAR's design follows the principle of achieving fine-grained understanding while retaining global context [39] - The introduction of a lightweight prompt encoding mechanism and RoI-Aligned Feature Replay allows for high-fidelity feature extraction from specified regions [46][49] Data Pipeline and Training - The training process involves multiple stages to enhance recognition capabilities and support multi-region associative reasoning [57][59][61] - The creation of GAR-Bench aims to systematically evaluate the region-level understanding capabilities of multimodal large language models (MLLMs) [64] Performance Evaluation - GAR models demonstrate superior performance in various benchmark tests, achieving high scores in both single-region and multi-region understanding tasks [71][74] - The results indicate GAR's effectiveness in generating rich, accurate, and detailed local descriptions, establishing it as a state-of-the-art solution [77] Zero-shot Transfer to Video Tasks - GAR's capabilities extend to video tasks, showing strong performance in zero-shot settings, even surpassing models specifically trained for video [79] - The article concludes with the potential applications of GAR in training multimodal understanding models and enhancing complex text instruction adherence [80][81]