Workflow
多模态大模型后训练
icon
Search documents
精准锁定「硬骨头」:难样本筛选破局SFT依赖,GRPO-only斩获感知推理双最优
量子位· 2025-11-28 04:11
Core Insights - The article presents a new research study that challenges the traditional belief that supervised fine-tuning (SFT) is a necessary precursor to reinforcement learning (RL) in the training of multimodal models, demonstrating that RL alone can effectively optimize multimodal capabilities [2][36]. Group 1: Research Findings - The study, conducted by Central South University and ZTE Corporation, introduces a quantifiable and operational "difficulty sampling" standard for multimodal models, validating the effectiveness of a training approach that relies solely on RL strategies (GRPO) [3][36]. - The research addresses two long-standing issues in multimodal post-training: the lack of quantifiable sample difficulty metrics and the inability of training paradigms to optimize perception and reasoning capabilities simultaneously [4][5][6]. Group 2: Methodology - Two complementary difficulty quantification strategies are proposed: Progressive Image Semantic Masking (PISM) and Cross-Modality Attention Balance (CMAB), which facilitate the hierarchical training framework [7][36]. - PISM involves progressively masking different parts of images to simulate varying degrees of visual information loss, allowing for the assessment of model performance based on its reliance on visual details [10][14]. - CMAB evaluates the complexity of cross-modal interactions by analyzing the attention scores of generated tokens across different Transformer layers, providing insights into the balance of attention between text and image inputs [19][34]. Group 3: Experimental Results - The experimental results indicate that the GRPO-only paradigm, which utilizes medium and difficult samples, significantly outperforms both full dataset training and random sample training, underscoring the importance of data quality over quantity [29][36]. - In visual reasoning tasks, the GRPO-only approach achieved optimal scores in multiple metrics, with notable improvements in MathVista (68.3) and OCRBench (77.8) compared to traditional methods [27][29]. - The study also highlights that SFT did not contribute to performance gains, suggesting that it may introduce "pseudo chains of thought" that limit the model's true reasoning capabilities [29][36]. Group 4: Future Directions - The research team outlines three future research directions: dynamic difficulty adjustment for adaptive learning, exploration of combined sampling strategies from PISM and CMAB, and validation of methods on larger multimodal models [38][39].
AI玩拼图游戏暴涨视觉理解力,告别文本中心训练,无需标注的多模态大模型后训练范式
量子位· 2025-10-15 10:20
Core Insights - The article emphasizes the importance of a vision-centric approach in post-training for multimodal large models, highlighting the potential of visual self-supervised learning to enhance the understanding of visual information [1] - A novel post-training task called Visual Jigsaw is introduced, which focuses on reconstructing visual data without relying on additional annotations or visual generation modules [1] Visual Jigsaw Method Overview - Visual Jigsaw is a general task for reconstructing visual information by dividing data (images, videos, 3D) into segments and shuffling them, with the model's goal being to predict the correct order [5] - The training process utilizes a reinforcement learning algorithm called GRPO to optimize the model's performance [5] Reward Mechanism - A tiered reward system is designed for validating the model's predictions, where a correct prediction receives a reward of 1, partial correctness is rewarded proportionally with a discount factor, and invalid outputs receive no reward [6] Task Design for Different Visual Modalities - **Image Jigsaw**: Images are divided into equal-sized sub-images in 2D space, and the model must restore the correct spatial order [7] - **Video Jigsaw**: Videos are segmented into equal-length clips, and the model needs to reconstruct the original temporal order [8] - **3D Jigsaw**: RGB-D images are sampled for depth points, requiring the model to restore the order from near to far based on marked positions and shuffled indices [9] Experimental Results - The effectiveness of Visual Jigsaw was validated across various image, video, and 3D modalities, showing significant improvements in fine-grained perception and understanding, spatial understanding from monocular images, and compositional visual reasoning [10][11] - For **Image Jigsaw**, models showed stable improvements across multiple vision-centric benchmarks, enhancing fine-grained perception and understanding [10][11] - For **Video Jigsaw**, the method significantly improved the model's ability to understand temporal relationships and overall video comprehension [14] - For **3D Jigsaw**, notable enhancements were observed in depth estimation tasks and overall 3D spatial reasoning capabilities [15] Conclusion - Visual Jigsaw presents a lightweight, verifiable, and annotation-free self-supervised post-training paradigm that revitalizes visual perception in multimodal large models, encouraging further exploration of vision-focused self/weak supervision tasks [16]