强化学习(RL)
Search documents
OpenAI揭秘Deep Research实现始末
锦秋集· 2025-04-30 07:09
Core Insights - OpenAI's Deep Research focuses on integrating search, browsing, filtering, and information synthesis into the model's core capabilities through reinforcement learning, rather than relying solely on prompt engineering [1][3][4] Group 1: Origin and Goals of Deep Research - The team shifted from simpler transactional tasks to tackling knowledge integration, which is deemed essential for achieving AGI [3][6] - Emphasis is placed on data quality over quantity, with a preference for expert-annotated high-value examples and reinforcement learning to optimize strategies [3][5] - The ultimate vision is to create a unified intelligent agent that autonomously determines the appropriate tools and maintains continuity in memory and context [3][14] Group 2: Development Process - The development process involved creating a demonstration version based on prompt engineering before focusing on data creation and model training [7][8] - The team utilized human trainers for data handling and designed new data types to train the model effectively [8][10] - Iterative collaboration with reinforcement learning teams allowed for significant improvements without the pressure of rapid product releases [7][8] Group 3: Reinforcement Learning Fine-Tuning (RFT) - RFT can enhance model performance for specific tasks, especially when the task is critical to business processes [9] - If a task is significantly different from the model's training, RFT is advisable; otherwise, waiting for natural model upgrades may be more beneficial [9] Group 4: Role of Human Expertise - High-quality data creation requires domain expertise to assess the validity and relevance of sources [11] - OpenAI's approach involves engaging experts across various fields to create diverse synthetic datasets [11] Group 5: Path to AGI and the Role of Reinforcement Learning - The resurgence of reinforcement learning has bolstered confidence in the path to AGI, though significant work remains to ensure models can effectively utilize tools and evaluate task outcomes [12][13] - A strong foundational model is essential for the success of reinforcement learning efforts [12] Group 6: User Trust and Interaction - Establishing user trust is crucial, necessitating explicit confirmations for significant operations during initial interactions [16] - As models improve, users may gradually allow more autonomy, but initial safeguards are necessary to prevent errors [16][17] Group 7: Future of Intelligent Agents - Future intelligent agents must address complex security issues, especially when accessing sensitive user data [17][19] - The goal is to create agents capable of executing long-duration tasks while effectively managing context and memory [17][21] Group 8: Performance and User Expectations - Users expect instant responses, but Deep Research requires time for in-depth analysis, leading to potential delays [29] - OpenAI plans to introduce products that balance the need for quick responses with the depth of research [29][30] Group 9: Applications and User Feedback - Users have found Deep Research valuable in fields like medical research and coding, validating its effectiveness [25][26] - The model excels in handling specific queries and generating comprehensive reports, making it suitable for detailed research tasks [27]
一堂「强化学习」大师课 | 42章经
42章经· 2025-04-13 12:02
吴翼: RL 是机器学习这个大概念下一类比较特殊的问题。 曲凯: 今天我们请来了国内强化学习 (RL) 领域的专家吴翼,吴翼目前是清华大学交叉信息研究院 助理教授,他曾经在 OpenAI 工作过,算是国内最早研究强化学习的人之一,我们今天就争取一 起把 RL 这个话题给大家聊透。 首先吴翼能不能简单解释一下,到底什么是 RL? 传统机器学习的本质是记住大量标注过正确答案的数据对。 举个例子,如果你想让机器学习能分辨一张图片是猫还是狗,就要先收集 10000 张猫的照片和 10000 张狗的照片,并且给每一张都做好标注,让模型背下来。 上一波人工智能四小龙的浪潮其实都以这套框架为基础,主要应用就是人脸识别、指纹识别、图 像识别等分类问题。 这类问题有两个特点,一是单一步骤,比如只要完成图片分辨就结束了;二是有明确的标准答 案。 但 RL 很不一样。 RL 最早是用来打游戏的,而游戏的特点和分类问题有两大区别。 第一,游戏过程中有非常多的动作和决策。比如我们玩一个打乒乓球的游戏,发球、接球、回 球,每一个动作都是非标的,而且不同的选择会直接影响最终的结果。 第二,赢得一场游戏的方式可能有上万种,并没有唯一的标准答 ...
一堂「强化学习」大师课 | 42章经
42章经· 2025-04-13 12:01AI Processing
曲凯: 今天我们请来了国内强化学习 (RL) 领域的专家吴翼,吴翼目前是清华大学交叉信息研究院助理教 授,他曾经在 OpenAI 工作过,算是国内最早研究强化学习的人之一,我们今天就争取一起把 RL 这个话题 给大家聊透。 举个例子,如果你想让机器学习能分辨一张图片是猫还是狗,就要先收集 10000 张猫的照片和 10000 张狗 的照片,并且给每一张都做好标注,让模型背下来。 首先吴翼能不能简单解释一下,到底什么是 RL? 上一波人工智能四小龙的浪潮其实都以这套框架为基础,主要应用就是人脸识别、指纹识别、图像识别等 分类问题。 吴翼: RL 是机器学习这个大概念下一类比较特殊的问题。 传统机器学习的本质是记住大量标注过正确答案的数据对。 所以我觉得人生有一个很好玩的地方是,你需要花很多时间先探索自己的奖励函数是什么,很多人可能努 力了很长时间,最后却发现找错了奖励函数。 这类问题有两个特点,一是单一步骤,比如只要完成图片分辨就结束了;二是有明确的标准答案。 但 RL 很不一样。 RL 最早是用来打游戏的,而游戏的特点和分类问题有两大区别。 第一,游戏过程中有非常多的动作和决策。比如我们玩一个打乒乓球的游戏, ...
一堂「强化学习」大师课 | 42章经
42章经· 2025-04-13 12:01
曲凯: 今天我们请来了国内强化学习 (RL) 领域的专家吴翼,吴翼目前是清华大学交叉信息研究院助理教授,他曾经在 OpenAI 工作过,算是国内最早研究强化学 习的人之一,我们今天就争取一起把 RL 这个话题给大家聊透。 首先吴翼能不能简单解释一下,到底什么是 RL? 因此,RL 其实更通用一些,它的逻辑和我们在真实生活中解决问题的逻辑非常接近。比如我要去美国出差,只要最后能顺利往返,中间怎么去机场、选什么航 司、具体坐哪个航班都是开放的。 但 RL 很不一样。 RL 最早是用来打游戏的,而游戏的特点和分类问题有两大区别。 第一,游戏过程中有非常多的动作和决策。比如我们玩一个打乒乓球的游戏,发球、接球、回球,每一个动作都是非标的,而且不同的选择会直接影响最终的结 果。 第二,赢得一场游戏的方式可能有上万种,并没有唯一的标准答案。 所以 RL 是一套用于解决多步决策问题的算法框架。它要解决的问题没有标准答案,每一步的具体决策也不受约束,但当完成所有决策后,会有一个反馈机制来评 判它最终做得好还是不好。 吴翼: RL 是机器学习这个大概念下一类比较特殊的问题。 传统机器学习的本质是记住大量标注过正确答案的数据对。 ...
业界突破多模态泛化推理能力,OPPO研究院&港科广提出OThink-MR1技术
量子位· 2025-03-30 02:37
Core Viewpoint - The article discusses the introduction of a new technology called OThink-MR1, developed by researchers from OPPO Research Institute and Hong Kong University of Science and Technology, which enhances multimodal language models' generalized reasoning capabilities through dynamic reinforcement learning [1][2][29]. Group 1: Technology Overview - OThink-MR1 extends reinforcement learning to multimodal language models, enabling them to better handle complex tasks and new scenarios [1][2]. - The technology addresses the limitations of existing multimodal models that primarily rely on supervised fine-tuning (SFT), which hinders the development of general reasoning abilities [4][5]. - OThink-MR1 employs two core components: dynamic KL divergence strategy (GRPO-D) and a carefully designed reward model, significantly improving learning efficiency and reasoning capabilities [8]. Group 2: Dynamic KL Divergence Strategy - The dynamic KL divergence strategy balances exploration of new strategies and utilization of existing experiences, adapting as training progresses [10][11]. - This approach prevents the model from getting stuck in local optima by encouraging exploration in the early stages and gradually shifting towards leveraging accumulated knowledge [12]. Group 3: Reward Model - The reward model in OThink-MR1 provides two types of rewards: validation accuracy reward and format reward, guiding the model's learning process [13][14]. - These rewards help the model understand its strengths and areas for improvement, promoting targeted learning [15]. Group 4: Experimental Validation - The first experiment demonstrated that incorporating format rewards significantly improved model performance in geometric reasoning tasks, highlighting the importance of both content and format in evaluations [17]. - The second experiment tested the model's cross-task evaluation, showing that the GRPO-D trained model excelled in diverse tasks, unlike models trained with SFT [21][23]. - The third experiment revealed that OThink-MR1's GRPO-D outperformed traditional SFT methods in same-task evaluations, indicating its effectiveness in enhancing model capabilities [28]. Group 5: Future Implications - OThink-MR1 represents a significant advancement in the development of multimodal language models, showcasing the potential of dynamic reinforcement learning to enhance reasoning and generalization abilities [29].
昨夜3件事,加强中国AI科技叙事?
华尔街见闻· 2025-03-06 11:11
昨晚到今天,AI圈有3个重磅消息,中国科技的叙事持续加强。 阿里通义没有食言,说这周再开源一个RL新模型,昨晚放出来了。最厉害的是32B性能比肩满血DeepSeek R1,在测试数学能力的AIME24评测集上,以及评 估代码能力的LiveCodeBench中,千问QwQ-32B表现与DeepSeek-R1相当,远胜于o1-mini及相同尺寸的R1蒸馏模型,现在已经可以在通义APP和网页端体 验了。 而且看起来,这个RL训练并没有花费太长时间,阿里的朋友反馈,与以往奖传统励模型不同的是,说这次是通过校验生成答案的正确性来为数学问题提供反 馈。 14:10 M Junvang Lin @ 17 阿里通义开源RL新模型 @ lustin| in610 This week we release QwQ-Max-Preview on Qwen Chat. I know you guys may think what happened to the opensource of this team. Here is a straight answer to you all: we will opensource the m ...