Workflow
强化学习(RL)
icon
Search documents
业界突破多模态泛化推理能力,OPPO研究院&港科广提出OThink-MR1技术
量子位· 2025-03-30 02:37
Core Viewpoint - The article discusses the introduction of a new technology called OThink-MR1, developed by researchers from OPPO Research Institute and Hong Kong University of Science and Technology, which enhances multimodal language models' generalized reasoning capabilities through dynamic reinforcement learning [1][2][29]. Group 1: Technology Overview - OThink-MR1 extends reinforcement learning to multimodal language models, enabling them to better handle complex tasks and new scenarios [1][2]. - The technology addresses the limitations of existing multimodal models that primarily rely on supervised fine-tuning (SFT), which hinders the development of general reasoning abilities [4][5]. - OThink-MR1 employs two core components: dynamic KL divergence strategy (GRPO-D) and a carefully designed reward model, significantly improving learning efficiency and reasoning capabilities [8]. Group 2: Dynamic KL Divergence Strategy - The dynamic KL divergence strategy balances exploration of new strategies and utilization of existing experiences, adapting as training progresses [10][11]. - This approach prevents the model from getting stuck in local optima by encouraging exploration in the early stages and gradually shifting towards leveraging accumulated knowledge [12]. Group 3: Reward Model - The reward model in OThink-MR1 provides two types of rewards: validation accuracy reward and format reward, guiding the model's learning process [13][14]. - These rewards help the model understand its strengths and areas for improvement, promoting targeted learning [15]. Group 4: Experimental Validation - The first experiment demonstrated that incorporating format rewards significantly improved model performance in geometric reasoning tasks, highlighting the importance of both content and format in evaluations [17]. - The second experiment tested the model's cross-task evaluation, showing that the GRPO-D trained model excelled in diverse tasks, unlike models trained with SFT [21][23]. - The third experiment revealed that OThink-MR1's GRPO-D outperformed traditional SFT methods in same-task evaluations, indicating its effectiveness in enhancing model capabilities [28]. Group 5: Future Implications - OThink-MR1 represents a significant advancement in the development of multimodal language models, showcasing the potential of dynamic reinforcement learning to enhance reasoning and generalization abilities [29].
昨夜3件事,加强中国AI科技叙事?
华尔街见闻· 2025-03-06 11:11
昨晚到今天,AI圈有3个重磅消息,中国科技的叙事持续加强。 阿里通义没有食言,说这周再开源一个RL新模型,昨晚放出来了。最厉害的是32B性能比肩满血DeepSeek R1,在测试数学能力的AIME24评测集上,以及评 估代码能力的LiveCodeBench中,千问QwQ-32B表现与DeepSeek-R1相当,远胜于o1-mini及相同尺寸的R1蒸馏模型,现在已经可以在通义APP和网页端体 验了。 而且看起来,这个RL训练并没有花费太长时间,阿里的朋友反馈,与以往奖传统励模型不同的是,说这次是通过校验生成答案的正确性来为数学问题提供反 馈。 14:10 M Junvang Lin @ 17 阿里通义开源RL新模型 @ lustin| in610 This week we release QwQ-Max-Preview on Qwen Chat. I know you guys may think what happened to the opensource of this team. Here is a straight answer to you all: we will opensource the m ...