Core Insights - OneThinker, a unified multimodal visual reasoning model developed by the MMLab of The Chinese University of Hong Kong and Meituan, demonstrates strong performance across 31 mainstream visual tasks, showcasing its generalization capabilities in both image and video modalities [1][2][22] Group 1: Model Development - OneThinker addresses the limitations of traditional RL models that typically handle single modalities or tasks, which lack inter-task and inter-modal connections, thus hindering generalization [4][6] - The model is designed to unify the understanding and reasoning of different modalities and tasks, overcoming knowledge isolation and transfer limitations present in specialized models [5][6] Group 2: Data Construction - A comprehensive dataset was constructed for OneThinker, including OneThinker-600k for reinforcement learning training and OneThinker-SFT-340k for supervised fine-tuning, covering ten core visual tasks [10] - The dataset aims to resolve issues of insufficient data coverage and task fragmentation, enabling the model to establish unified reasoning capabilities across spatial and temporal dimensions [10] Group 3: Training Methodology - OneThinker employs a novel EMA-GRPO (Exponential Moving Average Group Relative Policy Optimization) algorithm to enhance training stability and convergence speed in multi-task and multi-modal scenarios [12] - This approach addresses training imbalances by standardizing reward structures across tasks, thus improving the model's overall performance [12] Group 4: Experimental Results - OneThinker achieved impressive results in various benchmarks, including a 93.7% score in spatial localization tasks and high performance in image segmentation and tracking tasks [17][20][21] - The model also demonstrated zero-shot capabilities, adapting to unseen tasks such as point tracking and image quality assessment, indicating its robust generalization ability [22]
港中文联手美团开源“视觉推理通才”,图像视频10类任务一网打尽