类R1训练不再只看结果对错！港中文推出SophiaVL-R1模型

Core Insights - The article discusses the evolution of reasoning models, particularly focusing on the introduction of the SophiaVL-R1 model, which incorporates a "thinking reward" mechanism to enhance reasoning quality and generalization capabilities [3][5][13]. Group 1: Model Development - The SophiaVL-R1 model represents a significant advancement over previous models by not only rewarding correct answers but also evaluating the reasoning process behind those answers [3][7]. - This model has demonstrated superior performance in various mathematical and multimodal benchmark tests, outperforming larger models such as LLaVA-OneVision-72B, which has ten times the parameters [5][20]. Group 2: Thinking Reward Mechanism - The introduction of the "thinking reward" mechanism allows for a more comprehensive assessment of the reasoning process, ensuring that models learn effective reasoning strategies rather than relying on shortcuts [7][13]. - A specially designed dataset was created to score the reasoning processes, which includes diverse thinking patterns and errors, leading to the development of a "thinking scoring model" [10][11]. Group 3: Trust-GRPO Algorithm - To address the issue of reward hacking, the SophiaVL-R1 model employs the Trust-GRPO training algorithm, which assesses the credibility of thinking rewards based on comparative analysis of correct and incorrect answers [17][18]. - This algorithm enhances the stability and reliability of the training process by adjusting the credibility scores of rewards when discrepancies are detected [18]. Group 4: Performance Metrics - In various evaluation benchmarks, SophiaVL-R1-7B has shown remarkable reasoning and generalization abilities, achieving scores that directly compete with or exceed those of significantly larger models [20][21]. - The model's performance in specific benchmarks includes a score of 61.3 in MMMU and 2403.8 in MME, showcasing its effectiveness [21][23]. Group 5: Experimental Validation - Ablation studies indicate that all components of the SophiaVL-R1 model contribute effectively to its overall performance, with evidence showing faster and better training outcomes [22][23].