美团提出多模态推理新范式：RL+SFT非传统顺序组合突破传统训练瓶颈

Core Viewpoint - The article discusses the Metis-RISE framework developed by researchers from Meituan, which combines Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) in a novel way to enhance the reasoning capabilities of Multimodal Large Language Models (MLLMs) [1][2]. Summary by Sections Introduction of Metis-RISE Framework - The Metis-RISE framework integrates RL and SFT in a non-traditional sequence to effectively improve MLLMs' reasoning abilities [2][3]. Training Methodology - The training process consists of two phases: - Phase 1 focuses on RL incentives, allowing the model to explore freely and activate its potential [6]. - Phase 2 employs SFT to address specific weaknesses identified during the RL phase [7][8]. Performance Results - The models developed, Metis-RISE-7B and Metis-RISE-72B, achieved impressive scores on the OpenCompass multimodal reasoning leaderboard, with the 72B model ranking fourth overall [3][14]. - Metis-RISE-72B achieved an average score of 56.6, outperforming several proprietary models and demonstrating its competitive edge [13][14]. Comparative Analysis - The performance of Metis-RISE models was compared against proprietary models and open-source models, showing superior results, particularly in the >10B parameter category [11][12][13]. Ablation Studies - Detailed ablation studies indicated that the RL phase significantly improved the model's performance, with average scores increasing from 39.2 to 44.0 after applying RL [15][16]. Qualitative Analysis - Observations during the RL phase revealed a consistent increase in accuracy rewards and response lengths, indicating improved reasoning clarity as training progressed [17]. Future Directions - The team plans to continue exploring iterative applications of RL and SFT to further enhance reasoning capabilities and develop model-based validators for more complex reasoning scenarios [18].