Workflow
GThinker
icon
Search documents
超越O4-mini,多模态大模型终于学会回头「看」:中科院自动化所提出GThinker模型
机器之心· 2025-07-19 03:13
Core Viewpoint - The article discusses the limitations of existing multimodal large models in flexible visual interpretation and introduces GThinker, a new model designed to enhance multimodal reasoning capabilities through a novel "Cue-Guided Rethinking" approach [1][3][10]. Group 1: Limitations of Existing Models - Current multimodal models, despite advancements, struggle with general scenarios requiring flexible visual interpretation, often relying on knowledge-based reasoning without deep verification of visual cues [1][8]. - Existing methods, including structured CoT and reinforcement learning, exhibit significant limitations, particularly in correcting misinterpretations of visual cues during reasoning [8][10]. Group 2: Introduction of GThinker - GThinker is developed by researchers from the Institute of Automation, Chinese Academy of Sciences, aiming to achieve universal multimodal reasoning [2]. - The core innovation of GThinker is its "Cue-Guided Rethinking" mode, which allows the model to actively verify and correct its visual understanding during reasoning [3][10]. Group 3: Training Methodology - GThinker employs a two-stage training process to instill the ability for rethinking, starting with a supervised fine-tuning phase that builds a dataset of 7,000 high-quality samples for cold-starting the model's rethinking capabilities [20][21]. - The model utilizes a mixed reward mechanism in reinforcement learning to encourage active exploration across diverse tasks, enhancing its reasoning capabilities [23][24]. Group 4: Performance Results - GThinker has demonstrated superior performance on the challenging M³CoT comprehensive reasoning benchmark, surpassing the latest O4-mini model and achieving state-of-the-art results in various mathematical and knowledge reasoning tasks [4][26]. - In tests across multiple scenarios, GThinker outperformed or matched existing advanced models, indicating its effective learning of rethinking capabilities without causing specialization [28][30].