奖励机制单一化
Search documents
理想MindGPT-4o-Vision技术报告压缩版
理想TOP2· 2025-12-22 12:28
Core Insights - The article discusses the trade-offs between general capabilities and vertical domain adaptation in the context of transferring general multimodal large models (MLLM) to specific applications, highlighting issues like catastrophic forgetting and the lack of systematic post-training methodologies [2]. Group 1: Key Inefficiencies and Biases in Multimodal Model Training - Three critical inefficiencies are identified: 1. Resource allocation is inefficient, as traditional data synthesis methods treat all data equally, neglecting the differences in information density, leading to underutilization of high-value data and wastage of computational resources [3]. 2. Reward mechanisms can lead to a lack of diversity, where traditional reinforcement learning approaches encourage models to converge on a few safe response patterns, sacrificing output diversity and exploration, which weakens generalization capabilities [3]. 3. Unimodal spurious correlations arise when models overly rely on prior knowledge from language models rather than visual evidence, resulting in factual inaccuracies in industrial applications [3]. Group 2: MindGPT-4ov Post-Training Paradigm - The MindGPT-4ov post-training paradigm consists of four core modules: 1. Data construction based on information density (IDS) and a dual-labeling system [4]. 2. Supervised fine-tuning (SFT) through collaborative curriculum learning [4]. 3. Reinforcement learning (RL) utilizing a hybrid reward system [4]. 4. Infrastructure improvements for parallel training and inference optimization [4]. Group 3: Information Density Score (IDS) and Dynamic Synthesis Strategy - The IDS evaluates image data across four dimensions: subject diversity, spatial relationships, OCR text richness, and world knowledge relevance [4]. - A dynamic synthesis strategy adjusts the number of generated question-answer pairs based on IDS scores, ensuring efficient resource allocation [4]. Group 4: Collaborative Curriculum SFT Mechanism - The SFT mechanism employs a three-stage collaborative curriculum learning approach: 1. Cross-domain knowledge learning focuses on injecting vertical domain knowledge [6]. 2. Capability restoration uses general datasets to recover potential declines in general capabilities [6]. 3. Preference alignment optimizes response formats and reduces hallucinations using high-quality preference data [6]. Group 5: Hybrid Reward Mechanism in Reinforcement Learning - The RL phase introduces multiple reward signals to balance accuracy, diversity, and conciseness: 1. Pass@k rewards encourage exploration by rewarding any correct answer among the top k responses [7]. 2. Diversity rewards penalize semantically similar responses, promoting varied outputs [7]. 3. Length rewards impose penalties for overly lengthy responses, ensuring concise outputs [7]. 4. Adversarial hallucination data is used to penalize models that generate details without visual evidence [7]. Group 6: Label Construction and Data Synthesis - An expert-defined primary label system is expanded into a multi-level label tree to cover both vertical domain knowledge and general visual capabilities [5]. - Data synthesis involves matching images with coarse and fine-grained topics, generating QA pairs based on IDS scores, and filtering low-quality data through a multi-model voting mechanism [8]. Group 7: Performance Validation - MindGPT-4ov demonstrates superior performance in response conciseness, with an average response length significantly shorter than comparative models while maintaining a higher accuracy rate of 83.3% compared to 80.1% [9].