UniPixel
Search documents
多模态大模型首次实现像素级推理,3B参数超越72B传统模型,NeurIPS 2025收录
3 6 Ke· 2025-10-16 07:39
Core Insights - The article discusses the introduction of UniPixel, a unified pixel-level multimodal large model developed by a research team from Hong Kong Polytechnic University and Tencent ARC Lab, which can perform referring, segmentation, and reasoning tasks effectively [1][3][4]. Model Capabilities - UniPixel can accomplish three major tasks: target referring, pixel-level segmentation, and area reasoning, showcasing flexibility, precision, and scalability [3][4]. - The model has been accepted for presentation at NeurIPS 2025, with its code, data, and demo being open-sourced [3]. Technical Innovations - UniPixel redefines visual reasoning by enabling precise perception of specific areas or targets within images or videos, addressing limitations in traditional visual question-answering systems [4][6]. - The architecture is based on the Qwen2.5-VL model, supporting various input types and visual prompts, allowing for natural language responses and spatial-temporal masks [6][8]. Key Modules - The model incorporates three critical modules: a prompt encoder for visual prompts, an object memory bank for storing user-specified targets, and a mask decoder for generating precise spatial-temporal masks [8][12]. - UniPixel enhances its language model vocabulary with special tokens to facilitate the integration of visual prompts and memory retrieval processes [9]. Performance Evaluation - Extensive experiments on ten public benchmark datasets demonstrate UniPixel's superior performance across nine visual-language understanding tasks, particularly in segmentation tasks where it outperformed existing models [19][20]. - In the ReVOS reasoning segmentation benchmark, UniPixel achieved a J&F score of 62.1, surpassing all other models, indicating strong associative modeling capabilities between complex text prompts and pixel-level mask generation [20]. Training Data and Methodology - The training dataset comprises approximately 1 million samples, covering text, images, and videos, which enhances the model's adaptability across various task settings [17]. - The training strategy is modular and phased, allowing for collaborative training of visual encoders and language models without overfitting to specific tasks [16]. Future Implications - The introduction of UniPixel marks a significant milestone in multimodal AI, transitioning from modality alignment to fine-grained understanding, potentially leading to intelligent agents capable of precise focus and natural interaction [34].
多模态大模型首次实现像素级推理!3B参数超越72B传统模型,NeurIPS 2025收录
量子位· 2025-10-16 06:11
UniPixel团队 投稿 量子位 | 公众号 QbitAI 多模态大模型 首次 实现像素级推理,指代、分割、推理三大任务一网打尽! AI"看图说话"现在已经so easy,但即使是GPT-5、Gemini 2.5 Pro,也只能"看个大概",难以进行更精确的目标识别和推理。 对此,来自香港理工大学和腾讯ARC Lab的研究团队提出了首个统一的 像素级 多模态大模型—— UniPixel 。 话不多说,先来康康UniPixel的效果: 只需UniPixel一个模型,就能完成 目标指代 (Referring) 、 像素级分割 (Segmentation) 与 区域推理 (Reasoning) 三大任务,兼 具灵活性、精确性与可扩展性。 目前该论文已被NeurIPS 2025接收,而且代码、数据、Demo 全开源 ! 下面是更多详细信息。 UniPixel重新定义视觉推理 传统的视觉问答或描述系统,多数基于整体的图像或视频信息进行推理,缺乏对图中"具体区域"或"指定目标"的精确感知。 这不仅限制了其在医疗诊断、自动驾驶、人机交互等场景中的实际应用,也难以满足用户对"可控性"与"可解释性"的高阶需求。 以一个日常任 ...