Workflow
PixelRefer
icon
Search documents
PixelRefer :让AI从“看大图”走向“看懂每个对象”
机器之心· 2025-11-10 23:47
Core Insights - The article discusses the limitations of current multimodal large models (MLLMs) in achieving fine-grained, object-level understanding necessary for real-world applications like autonomous driving and medical imaging, highlighting the need for a more detailed visual understanding framework [2][38] - PixelRefer is introduced as an innovative solution that provides a unified spatio-temporal understanding framework capable of fine visual referencing and reasoning at arbitrary granularity, outperforming existing models in several benchmarks [2][38] Model Overview - PixelRefer integrates global visual tokens, pixel-level region tokens, and text tokens into a large language model (LLM), maintaining both scene context and object-level reasoning capabilities [16][22] - The model's lightweight version, PixelRefer-Lite, achieves a 4x increase in inference speed and reduces memory usage by half compared to existing models like DAM-3B [2][33] Methodology - The authors propose two frameworks for pixel-level fine-grained understanding: Vision-Object Framework and Object-Only Framework, emphasizing the importance of high-quality pixel-level object representation [15][22] - A Scale-Adaptive Object Tokenizer (SAOT) is introduced to generate precise and compact object representations, addressing challenges related to small and large object details [17][16] Performance Metrics - PixelRefer has achieved state-of-the-art (SOTA) performance across various image understanding benchmarks, including PACO and DLC-Bench, with notable advantages in reasoning scenarios [28][30] - In video pixel-level understanding benchmarks, PixelRefer also demonstrates superior performance, particularly in tasks involving video captioning and question answering [29][31] Applications and Future Directions - The advancements presented by PixelRefer signify a shift towards understanding the dynamic details of the world, with potential applications in autonomous driving, medical imaging, intelligent video editing, and multimodal dialogue systems [38][40]