Workflow
通用视觉智能(AGI)
icon
Search documents
港中文联手美团开源“视觉推理通才”,图像视频10类任务一网打尽
3 6 Ke· 2025-12-12 07:17
Core Insights - OneThinker, a unified multimodal visual reasoning model developed by the MMLab of The Chinese University of Hong Kong and Meituan, demonstrates strong performance across 31 mainstream visual tasks, showcasing its generalization capabilities in both image and video modalities [1][2][22] Group 1: Model Development - OneThinker addresses the limitations of traditional RL models that typically handle single modalities or tasks, which lack inter-task and inter-modal connections, thus hindering generalization [4][6] - The model is designed to unify the understanding and reasoning of different modalities and tasks, overcoming knowledge isolation and transfer limitations present in specialized models [5][6] Group 2: Data Construction - A comprehensive dataset was constructed for OneThinker, including OneThinker-600k for reinforcement learning training and OneThinker-SFT-340k for supervised fine-tuning, covering ten core visual tasks [10] - The dataset aims to resolve issues of insufficient data coverage and task fragmentation, enabling the model to establish unified reasoning capabilities across spatial and temporal dimensions [10] Group 3: Training Methodology - OneThinker employs a novel EMA-GRPO (Exponential Moving Average Group Relative Policy Optimization) algorithm to enhance training stability and convergence speed in multi-task and multi-modal scenarios [12] - This approach addresses training imbalances by standardizing reward structures across tasks, thus improving the model's overall performance [12] Group 4: Experimental Results - OneThinker achieved impressive results in various benchmarks, including a 93.7% score in spatial localization tasks and high performance in image segmentation and tracking tasks [17][20][21] - The model also demonstrated zero-shot capabilities, adapting to unseen tasks such as point tracking and image quality assessment, indicating its robust generalization ability [22]
港中文联手美团开源“视觉推理通才”!图像视频10类任务一网打尽
量子位· 2025-12-12 01:00
Core Insights - OneThinker is a unified multimodal visual reasoning model developed by the MMLab of The Chinese University of Hong Kong and Meituan, capable of handling ten core visual tasks across both image and video modalities [1][2][8]. Group 1: Model Capabilities - OneThinker has demonstrated impressive performance across 31 mainstream visual tasks, showcasing its ability to generalize and make reasonable inferences on previously unseen tasks [2][28]. - The model addresses limitations of traditional RL models, which typically handle single modalities or tasks, by enabling unified reasoning across different tasks and modalities [4][6][8]. Group 2: Data and Training Methodology - To build a model with general visual reasoning capabilities, the research team constructed a comprehensive dataset, OneThinker-600k, which includes image and video modalities and covers ten core visual tasks [14][15]. - The training methodology incorporates a new algorithm, EMA-GRPO, which enhances training stability and convergence speed by addressing reward structure imbalances across different tasks [19][20]. Group 3: Experimental Results - OneThinker achieved notable results in various tasks, such as scoring 70.6% in the MMMU image question-answering task and 66.2% in video understanding [22][25]. - The model also excelled in tracking tasks, achieving an AO score of 73.0 in the GOT-10k benchmark, indicating robust performance in perception-related tasks [25][27]. Group 4: Knowledge Transfer and Generalization - OneThinker exhibits effective knowledge transfer and sharing between tasks, allowing for mutual enhancement across different tasks [27][28]. - The model demonstrates zero-shot capabilities, adapting to new tasks like point tracking and image quality assessment, highlighting its strong generalization ability [28].