教多模态大模型学会“反思”和“复盘”，上交&上海AI Lab重磅发布MM-HELIX&AHPO，破解多模态复杂推理难题

Core Insights - The article discusses the limitations of current multimodal large models (MLLMs) in problem-solving, emphasizing their tendency to provide direct answers without iterative reasoning, which hinders their evolution from knowledge containers to problem-solving experts [1][2] Group 1: MM-HELIX Overview - The research team from Shanghai Jiao Tong University and Shanghai AI Lab has introduced MM-HELIX, a project aimed at endowing AI with long-chain reflective reasoning capabilities, closely resembling human intelligence [2] - MM-HELIX includes a comprehensive ecosystem designed to enhance the reflective reasoning abilities of AI models [2] Group 2: MM-HELIX Benchmark - The MM-HELIX Benchmark has been established as a rigorous testing ground for evaluating AI's reflective reasoning capabilities, featuring 42 high-difficulty tasks across algorithms, graph theory, puzzles, and strategy games [4][5] - The benchmark includes a sandbox environment with 1260 questions categorized into five levels of difficulty, allowing for fine-grained assessment of current multimodal large models [5] Group 3: Evaluation Results - Current leading models, including both proprietary and open-source, performed poorly on the MM-HELIX Benchmark, with only GPT-5 scoring above 50 points, while models lacking reflective capabilities scored around 10 points [7] - The accuracy of models significantly decreased when faced with multimodal inputs compared to pure text inputs, highlighting the urgent need to teach MLLMs reflective reasoning [7] Group 4: MM-HELIX-100K Dataset - To teach MLLMs to reflect, the team developed the MM-HELIX-100K dataset, containing 100,000 high-quality samples designed to foster reflective reasoning through a step-elicited response generation process [8] - This dataset aims to provide a rich source of self-correction and insight, essential for training MLLMs in reflective and iterative problem-solving [8] Group 5: AHPO Algorithm - The Adaptive Hybrid Policy Optimization (AHPO) algorithm has been introduced to facilitate a dynamic teaching approach, allowing models to learn from expert data while gradually encouraging independent thought [12][13] - AHPO addresses the challenges of catastrophic forgetting in direct fine-tuning and the sparsity of rewards in on-policy reinforcement learning [11][12] Group 6: Performance Improvements - The Qwen2.5-VL-7B model, enhanced with MM-HELIX-100K and AHPO, demonstrated significant improvements, achieving an 18.6% increase in accuracy on the MM-HELIX Benchmark and showcasing strong generalization across various reasoning tasks [18] - The model's ability to reflect and adapt has been proven to be a transferable meta-skill, moving beyond rote memorization to genuine understanding [15]