多模态数学推理 - filings, earnings calls, financial reports, news

多模态数学推理

Search documents

机器之心· 2025-08-27 10:40

Core Viewpoint - The article discusses the development and features of We-Math 2.0, a versatile math reasoning system aimed at enhancing visual mathematical reasoning through a structured knowledge system and innovative training strategies [5][9][45]. Group 1: Knowledge System - We-Math 2.0 establishes a comprehensive knowledge system consisting of 5 levels, 491 knowledge points, and 1819 principles, covering mathematics from elementary to university levels [9][14]. - The knowledge system is designed to ensure clear hierarchical relationships and logical connections between mathematical concepts, with each knowledge point linked to several fundamental principles [14]. Group 2: Data Expansion Strategies - MathBook-Standard employs a bidirectional data expansion strategy, generating multiple visual variations for each problem and multiple questions for the same image to enhance model generalization [17][15]. - The approach aims to cover all 1819 mathematical principles by associating each problem with corresponding multi-level knowledge points [17]. Group 3: Difficulty Modeling - MathBook-Pro introduces a three-dimensional difficulty modeling for multi-modal math problems, expanding each seed problem into seven difficulty levels based on reasoning steps, visual complexity, and contextual complexity [20][21]. - This modeling supports dynamic scheduling and reinforcement learning training, providing a structured path from basic to advanced reasoning [27]. Group 4: Training Strategies - The training strategy includes a cold start with 1,000 carefully selected data points for supervised fine-tuning (SFT), followed by a two-phase reinforcement learning approach [23][30]. - The reinforcement learning focuses on average rewards based on the model's performance across similar knowledge principles, enhancing the model's reasoning capabilities [25][30]. Group 5: Evaluation and Results - MathBookEval, a comprehensive evaluation framework, consists of 1,000 samples designed to assess the model's knowledge and reasoning depth, utilizing high-quality, manually rendered image data [11][12]. - Experimental results indicate that MathBook-7B, developed from We-Math 2.0, shows significant performance improvements over baseline models, particularly in knowledge generalization and multi-step problem-solving [32][35].

细粒度视觉推理链引入数学领域，准确率暴涨32%，港中文MMLab打破多模态数学推理瓶颈

量子位· 2025-06-16 10:30

Core Viewpoint - The article discusses the introduction of MINT-CoT, a new visual reasoning framework designed to enhance multimodal mathematical reasoning by addressing the limitations of traditional Chain of Thought (CoT) methods in handling visual information [1][3]. Group 1: Challenges in Mathematical Visual Reasoning - Traditional CoT methods struggle with integrating visual information in mathematical contexts due to three main bottlenecks: 1. Coarse-grained image region selection, where most methods rely on bounding boxes that may include irrelevant information [4]. 2. Visual encoders that are not trained to understand mathematical images, leading to poor perception of mathematical content [5]. 3. Over-reliance on external functionalities, which increases training and inference costs and reduces generalizability [6]. Group 2: MINT-CoT Framework - MINT-CoT (Multimodal Interleaved Chain-of-Thought) is introduced as a fine-grained, lightweight visual interleaved CoT reasoning method specifically designed for mathematical reasoning scenarios. It innovatively incorporates an Interleave Token that dynamically selects relevant visual tokens during the reasoning process, allowing for true "text-visual joint reasoning" [9]. - The MINT-CoT dataset consists of 54,000 visual interleaved reasoning samples, providing aligned information between reasoning steps and corresponding image tokens [11]. Group 3: Training Strategy - A three-stage training strategy is implemented to enhance the visual interleaved reasoning capabilities of the MINT-CoT framework: 1. Text CoT fine-tuning to establish a foundation for general reasoning formats. 2. Interleaved modality CoT fine-tuning to teach the model to insert visual content appropriately during reasoning. 3. Interleaved modality CoT reinforcement learning to optimize visual content selection and reasoning strategies [13]. Group 4: Experimental Results - The MINT-CoT-7B model, based on the multimodal large model Qwen-VL-7B, demonstrates superior performance in mathematical visual reasoning tasks, achieving significant improvements over baseline models: +32.59% on MathVista, +26.92% on GeoQA, and +23.2% on MMStar, establishing a new benchmark in the field [16].