强化学习（RL） - filings, earnings calls, financial reports, news

强化学习（RL）

Search documents

锦秋集· 2025-04-30 07:09

Core Insights - OpenAI's Deep Research focuses on integrating search, browsing, filtering, and information synthesis into the model's core capabilities through reinforcement learning, rather than relying solely on prompt engineering [1][3][4] Group 1: Origin and Goals of Deep Research - The team shifted from simpler transactional tasks to tackling knowledge integration, which is deemed essential for achieving AGI [3][6] - Emphasis is placed on data quality over quantity, with a preference for expert-annotated high-value examples and reinforcement learning to optimize strategies [3][5] - The ultimate vision is to create a unified intelligent agent that autonomously determines the appropriate tools and maintains continuity in memory and context [3][14] Group 2: Development Process - The development process involved creating a demonstration version based on prompt engineering before focusing on data creation and model training [7][8] - The team utilized human trainers for data handling and designed new data types to train the model effectively [8][10] - Iterative collaboration with reinforcement learning teams allowed for significant improvements without the pressure of rapid product releases [7][8] Group 3: Reinforcement Learning Fine-Tuning (RFT) - RFT can enhance model performance for specific tasks, especially when the task is critical to business processes [9] - If a task is significantly different from the model's training, RFT is advisable; otherwise, waiting for natural model upgrades may be more beneficial [9] Group 4: Role of Human Expertise - High-quality data creation requires domain expertise to assess the validity and relevance of sources [11] - OpenAI's approach involves engaging experts across various fields to create diverse synthetic datasets [11] Group 5: Path to AGI and the Role of Reinforcement Learning - The resurgence of reinforcement learning has bolstered confidence in the path to AGI, though significant work remains to ensure models can effectively utilize tools and evaluate task outcomes [12][13] - A strong foundational model is essential for the success of reinforcement learning efforts [12] Group 6: User Trust and Interaction - Establishing user trust is crucial, necessitating explicit confirmations for significant operations during initial interactions [16] - As models improve, users may gradually allow more autonomy, but initial safeguards are necessary to prevent errors [16][17] Group 7: Future of Intelligent Agents - Future intelligent agents must address complex security issues, especially when accessing sensitive user data [17][19] - The goal is to create agents capable of executing long-duration tasks while effectively managing context and memory [17][21] Group 8: Performance and User Expectations - Users expect instant responses, but Deep Research requires time for in-depth analysis, leading to potential delays [29] - OpenAI plans to introduce products that balance the need for quick responses with the depth of research [29][30] Group 9: Applications and User Feedback - Users have found Deep Research valuable in fields like medical research and coding, validating its effectiveness [25][26] - The model excels in handling specific queries and generating comprehensive reports, making it suitable for detailed research tasks [27]

通用人工智能（AGI）

强化学习（RL）

强化学习微调（RFT）

Artificial Intelligence

Artificial Intelligence

Deep Research

一堂「强化学习」大师课 | 42章经

42章经· 2025-04-13 12:02

吴翼： RL 是机器学习这个大概念下一类比较特殊的问题。曲凯：今天我们请来了国内强化学习 (RL) 领域的专家吴翼，吴翼目前是清华大学交叉信息研究院助理教授，他曾经在 OpenAI 工作过，算是国内最早研究强化学习的人之一，我们今天就争取一起把 RL 这个话题给大家聊透。首先吴翼能不能简单解释一下，到底什么是 RL？传统机器学习的本质是记住大量标注过正确答案的数据对。举个例子，如果你想让机器学习能分辨一张图片是猫还是狗，就要先收集 10000 张猫的照片和 10000 张狗的照片，并且给每一张都做好标注，让模型背下来。上一波人工智能四小龙的浪潮其实都以这套框架为基础，主要应用就是人脸识别、指纹识别、图像识别等分类问题。这类问题有两个特点，一是单一步骤，比如只要完成图片分辨就结束了；二是有明确的标准答案。但 RL 很不一样。 RL 最早是用来打游戏的，而游戏的特点和分类问题有两大区别。第一，游戏过程中有非常多的动作和决策。比如我们玩一个打乒乓球的游戏，发球、接球、回球，每一个动作都是非标的，而且不同的选择会直接影响最终的结果。第二，赢得一场游戏的方式可能有上万种，并没有唯一的标准答 ...

inference time scaling

Artificial Intelligence

inference time scaling

Artificial Intelligence

一堂「强化学习」大师课 | 42章经

42章经· 2025-04-13 12:01AI Processing

曲凯：今天我们请来了国内强化学习 (RL) 领域的专家吴翼，吴翼目前是清华大学交叉信息研究院助理教授，他曾经在 OpenAI 工作过，算是国内最早研究强化学习的人之一，我们今天就争取一起把 RL 这个话题给大家聊透。举个例子，如果你想让机器学习能分辨一张图片是猫还是狗，就要先收集 10000 张猫的照片和 10000 张狗的照片，并且给每一张都做好标注，让模型背下来。首先吴翼能不能简单解释一下，到底什么是 RL？上一波人工智能四小龙的浪潮其实都以这套框架为基础，主要应用就是人脸识别、指纹识别、图像识别等分类问题。吴翼： RL 是机器学习这个大概念下一类比较特殊的问题。传统机器学习的本质是记住大量标注过正确答案的数据对。所以我觉得人生有一个很好玩的地方是，你需要花很多时间先探索自己的奖励函数是什么，很多人可能努力了很长时间，最后却发现找错了奖励函数。这类问题有两个特点，一是单一步骤，比如只要完成图片分辨就结束了；二是有明确的标准答案。但 RL 很不一样。 RL 最早是用来打游戏的，而游戏的特点和分类问题有两大区别。第一，游戏过程中有非常多的动作和决策。比如我们玩一个打乒乓球的游戏， ...

inference time scaling

Artificial Intelligence

inference time scaling

Artificial Intelligence

一堂「强化学习」大师课 | 42章经

42章经· 2025-04-13 12:01

曲凯：今天我们请来了国内强化学习 (RL) 领域的专家吴翼，吴翼目前是清华大学交叉信息研究院助理教授，他曾经在 OpenAI 工作过，算是国内最早研究强化学习的人之一，我们今天就争取一起把 RL 这个话题给大家聊透。首先吴翼能不能简单解释一下，到底什么是 RL？因此，RL 其实更通用一些，它的逻辑和我们在真实生活中解决问题的逻辑非常接近。比如我要去美国出差，只要最后能顺利往返，中间怎么去机场、选什么航司、具体坐哪个航班都是开放的。但 RL 很不一样。 RL 最早是用来打游戏的，而游戏的特点和分类问题有两大区别。第一，游戏过程中有非常多的动作和决策。比如我们玩一个打乒乓球的游戏，发球、接球、回球，每一个动作都是非标的，而且不同的选择会直接影响最终的结果。第二，赢得一场游戏的方式可能有上万种，并没有唯一的标准答案。所以 RL 是一套用于解决多步决策问题的算法框架。它要解决的问题没有标准答案，每一步的具体决策也不受约束，但当完成所有决策后，会有一个反馈机制来评判它最终做得好还是不好。吴翼： RL 是机器学习这个大概念下一类比较特殊的问题。传统机器学习的本质是记住大量标注过正确答案的数据对。 ...

inference time scaling

Artificial Intelligence

inference time scaling

Artificial Intelligence

业界突破多模态泛化推理能力，OPPO研究院&港科广提出OThink-MR1技术

量子位· 2025-03-30 02:37

Core Viewpoint - The article discusses the introduction of a new technology called OThink-MR1, developed by researchers from OPPO Research Institute and Hong Kong University of Science and Technology, which enhances multimodal language models' generalized reasoning capabilities through dynamic reinforcement learning [1][2][29]. Group 1: Technology Overview - OThink-MR1 extends reinforcement learning to multimodal language models, enabling them to better handle complex tasks and new scenarios [1][2]. - The technology addresses the limitations of existing multimodal models that primarily rely on supervised fine-tuning (SFT), which hinders the development of general reasoning abilities [4][5]. - OThink-MR1 employs two core components: dynamic KL divergence strategy (GRPO-D) and a carefully designed reward model, significantly improving learning efficiency and reasoning capabilities [8]. Group 2: Dynamic KL Divergence Strategy - The dynamic KL divergence strategy balances exploration of new strategies and utilization of existing experiences, adapting as training progresses [10][11]. - This approach prevents the model from getting stuck in local optima by encouraging exploration in the early stages and gradually shifting towards leveraging accumulated knowledge [12]. Group 3: Reward Model - The reward model in OThink-MR1 provides two types of rewards: validation accuracy reward and format reward, guiding the model's learning process [13][14]. - These rewards help the model understand its strengths and areas for improvement, promoting targeted learning [15]. Group 4: Experimental Validation - The first experiment demonstrated that incorporating format rewards significantly improved model performance in geometric reasoning tasks, highlighting the importance of both content and format in evaluations [17]. - The second experiment tested the model's cross-task evaluation, showing that the GRPO-D trained model excelled in diverse tasks, unlike models trained with SFT [21][23]. - The third experiment revealed that OThink-MR1's GRPO-D outperformed traditional SFT methods in same-task evaluations, indicating its effectiveness in enhancing model capabilities [28]. Group 5: Future Implications - OThink-MR1 represents a significant advancement in the development of multimodal language models, showcasing the potential of dynamic reinforcement learning to enhance reasoning and generalization abilities [29].

Artificial Intelligence

Artificial Intelligence

OThink - MR1

昨夜3件事，加强中国AI科技叙事？

华尔街见闻· 2025-03-06 11:11

昨晚到今天，AI圈有3个重磅消息，中国科技的叙事持续加强。阿里通义没有食言，说这周再开源一个RL新模型，昨晚放出来了。最厉害的是32B性能比肩满血DeepSeek R1，在测试数学能力的AIME24评测集上，以及评估代码能力的LiveCodeBench中，千问QwQ-32B表现与DeepSeek-R1相当，远胜于o1-mini及相同尺寸的R1蒸馏模型，现在已经可以在通义APP和网页端体验了。而且看起来，这个RL训练并没有花费太长时间，阿里的朋友反馈，与以往奖传统励模型不同的是，说这次是通过校验生成答案的正确性来为数学问题提供反馈。 14:10 M Junvang Lin @ 17 阿里通义开源RL新模型 @ lustin| in610 This week we release QwQ-Max-Preview on Qwen Chat. I know you guys may think what happened to the opensource of this team. Here is a straight answer to you all: we will opensource the m ...