港中文联手美团开源“视觉推理通才”！图像视频10类任务一网打尽

Core Insights - OneThinker is a unified multimodal visual reasoning model developed by the MMLab of The Chinese University of Hong Kong and Meituan, capable of handling ten core visual tasks across both image and video modalities [1][2][8]. Group 1: Model Capabilities - OneThinker has demonstrated impressive performance across 31 mainstream visual tasks, showcasing its ability to generalize and make reasonable inferences on previously unseen tasks [2][28]. - The model addresses limitations of traditional RL models, which typically handle single modalities or tasks, by enabling unified reasoning across different tasks and modalities [4][6][8]. Group 2: Data and Training Methodology - To build a model with general visual reasoning capabilities, the research team constructed a comprehensive dataset, OneThinker-600k, which includes image and video modalities and covers ten core visual tasks [14][15]. - The training methodology incorporates a new algorithm, EMA-GRPO, which enhances training stability and convergence speed by addressing reward structure imbalances across different tasks [19][20]. Group 3: Experimental Results - OneThinker achieved notable results in various tasks, such as scoring 70.6% in the MMMU image question-answering task and 66.2% in video understanding [22][25]. - The model also excelled in tracking tasks, achieving an AO score of 73.0 in the GOT-10k benchmark, indicating robust performance in perception-related tasks [25][27]. Group 4: Knowledge Transfer and Generalization - OneThinker exhibits effective knowledge transfer and sharing between tasks, allowing for mutual enhancement across different tasks [27][28]. - The model demonstrates zero-shot capabilities, adapting to new tasks like point tracking and image quality assessment, highlighting its strong generalization ability [28].