细粒度视觉推理链引入数学领域，准确率暴涨32%，港中文MMLab打破多模态数学推理瓶颈

Core Viewpoint - The article discusses the introduction of MINT-CoT, a new visual reasoning framework designed to enhance multimodal mathematical reasoning by addressing the limitations of traditional Chain of Thought (CoT) methods in handling visual information [1][3]. Group 1: Challenges in Mathematical Visual Reasoning - Traditional CoT methods struggle with integrating visual information in mathematical contexts due to three main bottlenecks: 1. Coarse-grained image region selection, where most methods rely on bounding boxes that may include irrelevant information [4]. 2. Visual encoders that are not trained to understand mathematical images, leading to poor perception of mathematical content [5]. 3. Over-reliance on external functionalities, which increases training and inference costs and reduces generalizability [6]. Group 2: MINT-CoT Framework - MINT-CoT (Multimodal Interleaved Chain-of-Thought) is introduced as a fine-grained, lightweight visual interleaved CoT reasoning method specifically designed for mathematical reasoning scenarios. It innovatively incorporates an Interleave Token that dynamically selects relevant visual tokens during the reasoning process, allowing for true "text-visual joint reasoning" [9]. - The MINT-CoT dataset consists of 54,000 visual interleaved reasoning samples, providing aligned information between reasoning steps and corresponding image tokens [11]. Group 3: Training Strategy - A three-stage training strategy is implemented to enhance the visual interleaved reasoning capabilities of the MINT-CoT framework: 1. Text CoT fine-tuning to establish a foundation for general reasoning formats. 2. Interleaved modality CoT fine-tuning to teach the model to insert visual content appropriately during reasoning. 3. Interleaved modality CoT reinforcement learning to optimize visual content selection and reasoning strategies [13]. Group 4: Experimental Results - The MINT-CoT-7B model, based on the multimodal large model Qwen-VL-7B, demonstrates superior performance in mathematical visual reasoning tasks, achieving significant improvements over baseline models: +32.59% on MathVista, +26.92% on GeoQA, and +23.2% on MMStar, establishing a new benchmark in the field [16].