精准锁定「硬骨头」:难样本筛选破局SFT依赖,GRPO-only斩获感知推理双最优
量子位·2025-11-28 04:11

Core Insights - The article presents a new research study that challenges the traditional belief that supervised fine-tuning (SFT) is a necessary precursor to reinforcement learning (RL) in the training of multimodal models, demonstrating that RL alone can effectively optimize multimodal capabilities [2][36]. Group 1: Research Findings - The study, conducted by Central South University and ZTE Corporation, introduces a quantifiable and operational "difficulty sampling" standard for multimodal models, validating the effectiveness of a training approach that relies solely on RL strategies (GRPO) [3][36]. - The research addresses two long-standing issues in multimodal post-training: the lack of quantifiable sample difficulty metrics and the inability of training paradigms to optimize perception and reasoning capabilities simultaneously [4][5][6]. Group 2: Methodology - Two complementary difficulty quantification strategies are proposed: Progressive Image Semantic Masking (PISM) and Cross-Modality Attention Balance (CMAB), which facilitate the hierarchical training framework [7][36]. - PISM involves progressively masking different parts of images to simulate varying degrees of visual information loss, allowing for the assessment of model performance based on its reliance on visual details [10][14]. - CMAB evaluates the complexity of cross-modal interactions by analyzing the attention scores of generated tokens across different Transformer layers, providing insights into the balance of attention between text and image inputs [19][34]. Group 3: Experimental Results - The experimental results indicate that the GRPO-only paradigm, which utilizes medium and difficult samples, significantly outperforms both full dataset training and random sample training, underscoring the importance of data quality over quantity [29][36]. - In visual reasoning tasks, the GRPO-only approach achieved optimal scores in multiple metrics, with notable improvements in MathVista (68.3) and OCRBench (77.8) compared to traditional methods [27][29]. - The study also highlights that SFT did not contribute to performance gains, suggesting that it may introduce "pseudo chains of thought" that limit the model's true reasoning capabilities [29][36]. Group 4: Future Directions - The research team outlines three future research directions: dynamic difficulty adjustment for adaptive learning, exploration of combined sampling strategies from PISM and CMAB, and validation of methods on larger multimodal models [38][39].