Workflow
优势样本回放
icon
Search documents
首个多模态专用慢思考框架!超GPT-o1近7个百分点,强化学习教会VLM「三思而后行」
量子位· 2025-06-06 13:45
Core Insights - The article discusses the limitations of "slow thinking" models like GPT-o1 and DeepSeek-R1 in multi-modal reasoning scenarios compared to "fast thinking" models like GPT-4o, highlighting that these slow thinking models perform similarly or worse in certain benchmarks [1][2]. Group 1: Challenges in Multi-Modal Reasoning - The research identifies two main challenges in developing slow thinking capabilities in visual language models (VLM): "vanishing advantages" and "reflective inertia" [2][3]. - "Vanishing advantages" occurs when all responses to a query receive the same reward, leading to a significant increase in zero-advantage samples during training, which hampers the model's learning [3][4]. - Reflective inertia in VLMs is attributed to their reliance on visual perception and a lack of diverse reflective patterns in pre-training data, making them less capable of engaging in deep reasoning processes [5][6]. Group 2: VL-Rethinker Framework - To address the challenges of limited high-quality training data, the research team developed the ViRL39K dataset, which includes 38,870 high-quality multi-modal reasoning questions across eight themes [7][8][9]. - The VL-Rethinker framework incorporates two key innovations: Selective Sample Replay (SSR) and Forced Rethinking [17]. - SSR focuses on dynamically storing and replaying high-value training samples to mitigate the vanishing advantages issue, enhancing training efficiency [19][20]. - Forced Rethinking introduces a mechanism to trigger a second reasoning process after the model generates an initial response, promoting diverse reflective behaviors [21][25]. Group 3: Experimental Results - The VL-Rethinker model achieved significant improvements in multi-modal reasoning tasks, outperforming the GPT-o1 model in MathVista (80.4% vs. 73.4%) and MathVerse (63.5% vs. 57.0%) [27]. - In multi-disciplinary understanding tests, VL-Rethinker achieved 55.9% on MMMU-Pro and 38.5% on EMMA, setting new state-of-the-art performance levels [28]. - The iterative improvements of the VL-Rethinker model demonstrated the effectiveness of SSR and the potential of slow thinking in multi-modal contexts, with notable performance gains over the base model Qwen2.5-VL-72B [29].