多模态大模型首次实现像素级推理！3B参数超越72B传统模型，NeurIPS 2025收录

Core Insights - The article discusses the introduction of UniPixel, a unified pixel-level multimodal model developed by a research team from Hong Kong Polytechnic University and Tencent ARC Lab, which aims to enhance visual reasoning capabilities in AI systems [2][4]. Group 1: Model Overview - UniPixel is designed to perform three major tasks: referring, pixel-level segmentation, and reasoning, all within a single model, showcasing flexibility, precision, and scalability [4][8]. - The model has been accepted for presentation at NeurIPS 2025, and its code, data, and demo are fully open-sourced [5]. Group 2: Technical Innovations - UniPixel redefines visual reasoning by addressing the limitations of traditional visual question-answering systems, which often lack precise perception of specific areas or targets within images [8][9]. - The model incorporates an "Object Memory Bank" and supports three types of visual prompts (point, box, mask), enabling a comprehensive "perception-memory-reasoning" process [9][12]. Group 3: Architecture and Functionality - The architecture of UniPixel is based on the Qwen2.5-VL model, allowing it to process various inputs, including images, videos, and text prompts, and generate natural language responses along with spatial-temporal masks [12][14]. - Key components include a Prompt Encoder for unified encoding of visual prompts, an Object Memory Bank for storing user-specified targets, and a Mask Decoder for generating precise temporal masks [19][21]. Group 4: Training and Evaluation - The training process for UniPixel involved a modular and phased strategy, utilizing approximately 1 million samples across various datasets to enhance its adaptability to different tasks [28][29]. - Extensive experiments were conducted on 10 public benchmark datasets covering 9 major visual-language understanding tasks, demonstrating superior performance in complex reasoning and segmentation tasks [31][33]. Group 5: Performance Metrics - In the ReVOS reasoning segmentation benchmark, UniPixel-3B achieved a score of 62.1 J&F, surpassing all existing models, indicating its strong capability in associating complex text prompts with pixel-level mask generation [33]. - The model also excelled in other datasets such as MeViS, Ref-YouTube-VOS, and RefCOCO, showcasing its leading performance across various visual understanding tasks [33][34]. Group 6: Future Implications - The introduction of UniPixel marks a significant milestone in multimodal AI, transitioning from "modal alignment" to "fine-grained understanding," effectively merging object referring and segmentation with language reasoning [47][48].