多模态大模型首次实现像素级推理，3B参数超越72B传统模型，NeurIPS 2025收录

Core Insights - The article discusses the introduction of UniPixel, a unified pixel-level multimodal large model developed by a research team from Hong Kong Polytechnic University and Tencent ARC Lab, which can perform referring, segmentation, and reasoning tasks effectively [1][3][4]. Model Capabilities - UniPixel can accomplish three major tasks: target referring, pixel-level segmentation, and area reasoning, showcasing flexibility, precision, and scalability [3][4]. - The model has been accepted for presentation at NeurIPS 2025, with its code, data, and demo being open-sourced [3]. Technical Innovations - UniPixel redefines visual reasoning by enabling precise perception of specific areas or targets within images or videos, addressing limitations in traditional visual question-answering systems [4][6]. - The architecture is based on the Qwen2.5-VL model, supporting various input types and visual prompts, allowing for natural language responses and spatial-temporal masks [6][8]. Key Modules - The model incorporates three critical modules: a prompt encoder for visual prompts, an object memory bank for storing user-specified targets, and a mask decoder for generating precise spatial-temporal masks [8][12]. - UniPixel enhances its language model vocabulary with special tokens to facilitate the integration of visual prompts and memory retrieval processes [9]. Performance Evaluation - Extensive experiments on ten public benchmark datasets demonstrate UniPixel's superior performance across nine visual-language understanding tasks, particularly in segmentation tasks where it outperformed existing models [19][20]. - In the ReVOS reasoning segmentation benchmark, UniPixel achieved a J&F score of 62.1, surpassing all other models, indicating strong associative modeling capabilities between complex text prompts and pixel-level mask generation [20]. Training Data and Methodology - The training dataset comprises approximately 1 million samples, covering text, images, and videos, which enhances the model's adaptability across various task settings [17]. - The training strategy is modular and phased, allowing for collaborative training of visual encoders and language models without overfitting to specific tasks [16]. Future Implications - The introduction of UniPixel marks a significant milestone in multimodal AI, transitioning from modality alignment to fine-grained understanding, potentially leading to intelligent agents capable of precise focus and natural interaction [34].