强化学习 - filings, earnings calls, financial reports, news - Reportify

强化学习

Search documents

四位图灵奖掌舵：2025智源大会揭示AI进化新路径

机器之心· 2025-05-23 04:17

2006 年，多伦多大学 Geoffrey Hinton 教授等人提出逐层预训练方法，突破了深层神经网络训练的技术瓶颈，为深度学习的复兴奠定了基础。这个初夏四位图灵奖得主强化学习作为智能体与环境交互的学习范式，其核心思想早于深度学习兴起。2013 年 DeepMind 提出的 DQN 已初步实现深度学习与强化学习的结合，而 2016 年 AlphaGo 的成功则将深度学习与强化学习的融合推向公众视野，显著提升了这一交叉领域的关注度。 2025 年 6 月 6-7 日中国，北京与全球创新力量共赴智源大会即刻报名，探寻 AI 时代的无尽边域基础理论在 AI 发展史上，连接主义（以神经网络为代表）与行为主义（以强化学习为代表）虽源自不同理论脉络，但二者的技术交叉早有端倪。这两条主线原本独立成长、各自发展，如今交织融合，万宗归一，共同构成了下一代通用人工智能的基石。 6 月 6 日，关于深度学习和强化学习的探讨，将在 2025 智源大会继续开展，如「双星交汇」般的时空对话，总结过往、共探智能之谜的终极答案。与此同时，推理大模型的兴起、开源生态的加速、具身智能的百花齐放，成为 2025 ...

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

5分钟读懂Lilian Weng万字长文：大模型是怎么思考的？

Hu Xiu· 2025-05-22 09:54

Core Insights - The article discusses the latest paradigms in AI, particularly focusing on the concept of "test-time compute" and how large language models (LLMs) can enhance their reasoning capabilities through various methods [3][12][26]. Group 1: AI Paradigms - The blog systematically organizes the latest paradigms in AI, emphasizing "test-time compute" [3]. - LLMs exhibit similarities to human thought processes, drawing parallels with Daniel Kahneman's "Thinking, Fast and Slow" [4][5]. - The reasoning process in LLMs can be likened to human cognitive systems, where "System 1" represents quick, intuitive responses, and "System 2" denotes slower, analytical thinking [6][7]. Group 2: Enhancing Reasoning in LLMs - The concept of "Chain of Thought" (CoT) allows models to allocate variable computational resources based on problem complexity, particularly beneficial for complex reasoning tasks [9]. - Reinforcement learning (RL) has been scaled up in reasoning, with significant changes initiated by OpenAI's developments [14]. - The training process of models like DeepSeek R1 involves parallel sampling and sequential improvement, enhancing the reasoning capabilities of LLMs [15][16]. Group 3: External Tool Utilization - The use of external tools during the reasoning process can improve efficiency and accuracy, such as employing code interpreters for complex calculations [19]. - OpenAI's recent models, o3 and o4-mini, emphasize the importance of tool usage, which marks a paradigm shift in AI development [20][21]. Group 4: Future Research Directions - The article raises open questions for future research, such as improving RNNs to dynamically adjust computation layers and enhancing Transformer architectures for better reasoning [28]. - It also discusses the challenge of training models to generate human-readable CoTs that accurately reflect their reasoning processes while avoiding reward hacking [29][30].

大模型思考

外部工具使用

test-time compute

思考忠实性

大模型思考

外部工具使用

test-time compute

思考忠实性

翁荔最新万字长文：Why We Think

量子位· 2025-05-18 05:20

Core Insights - The article discusses the concepts of "Test-time Compute" and "Chain-of-Thought" (CoT) as methods to significantly enhance model performance in artificial intelligence [1][2][6] Group 1: Motivation and Theoretical Background - Allowing models to think longer before providing answers can be achieved through various methods, enhancing their intelligence and overcoming current limitations [2][8] - The core idea is deeply related to human thinking processes, where humans require time to analyze complex problems, aligning with Daniel Kahneman's dual-system theory from "Thinking, Fast and Slow" [10][11] - By consciously slowing down and reflecting, models can engage in more rational decision-making, akin to human System 2 thinking [11][12] Group 2: Computational Resources and Model Architecture - Deep learning views neural networks as capable of accessing computational and storage resources, optimizing their use through gradient descent [13] - In Transformer models, the computational load (flops) for each generated token is approximately double the number of parameters, with sparse models like Mixture of Experts (MoE) utilizing only a fraction of parameters during each forward pass [13] - CoT allows models to perform more computations for each token based on the difficulty of the problem, enabling variable computational loads [13][18] Group 3: CoT and Learning Techniques - Early improvements in CoT involved generating intermediate steps for mathematical problems, with subsequent research showing that reinforcement learning can significantly enhance CoT reasoning capabilities [19][20] - Supervised learning on human-written reasoning paths and appropriate prompts can greatly improve the mathematical abilities of instruction-tuned models [21][23] - The effectiveness of CoT prompts in increasing success rates for solving mathematical problems is more pronounced in larger models [23] Group 4: Sampling and Revision Techniques - The fundamental goal of test-time computation is to adaptively modify the model's output distribution during reasoning [24] - Parallel sampling methods are straightforward but limited by the model's ability to generate correct solutions in one go, while sequential revision requires careful execution to avoid introducing errors [24][25] - Combining both methods can yield optimal results, with simpler problems benefiting from sequential testing and more complex problems performing best with a mix of both approaches [24][25] Group 5: Advanced Techniques and Future Directions - Various advanced algorithms, such as Best-of-N and Beam Search, are employed to optimize the search process for high-scoring samples [29][30] - The RATIONALYST system focuses on synthesizing reasoning based on vast unannotated data, providing implicit and explicit guidance for generating reasoning steps [32][33] - Future challenges include enhancing computational efficiency, integrating self-correction mechanisms, and ensuring the reliability of reasoning outputs [47][50]

测试时计算

双系统理论

潜变量建模

测试时计算

双系统理论

潜变量建模

刚刚！北大校友Lilian Weng最新博客来了：Why We Think

机器之心· 2025-05-18 04:25

Core Insights - The article discusses advancements in utilizing "thinking time" during model inference, aiming to enhance the reasoning capabilities of AI models like GPT, Claude, and Gemini [2][3][16]. Group 1: Thinking Mechanisms - The concept of "thinking time" is analogous to human cognitive processes, where complex problems require reflection and analysis before arriving at a solution [6]. - Daniel Kahneman's dual process theory categorizes human thinking into fast (System 1) and slow (System 2) modes, emphasizing the importance of slower, more deliberate thought for accurate decision-making [12]. Group 2: Computational Resources - In deep learning, neural networks can be characterized by the computational and storage resources they utilize during each forward pass, impacting their performance [8]. - The efficiency of models can be improved by allowing them to perform more computations during inference, particularly through strategies like Chain of Thought (CoT) prompting [8][18]. Group 3: Chain of Thought (CoT) and Learning Strategies - CoT prompting significantly enhances the success rate of solving mathematical problems, with larger models benefiting more from extended "thinking time" [16]. - Early research focused on supervised learning from human-written reasoning paths, evolving into reinforcement learning strategies that improve CoT reasoning capabilities [14][41]. Group 4: Test-Time Computation Strategies - Two main strategies for improving generation quality are parallel sampling and sequential revision, each with distinct advantages and challenges [19][20]. - Parallel sampling is straightforward but relies on the model's ability to generate correct answers in one go, while sequential revision allows for targeted corrections but is slower [20][21]. Group 5: Reinforcement Learning Applications - Recent studies have successfully employed reinforcement learning to enhance reasoning capabilities in language models, particularly in STEM-related tasks [41][46]. - The training process often involves a cold-start phase followed by reasoning-oriented reinforcement learning, optimizing performance through structured feedback [42][43]. Group 6: External Tools and Integration - Utilizing external tools, such as code interpreters or APIs, can enhance the reasoning process by offloading certain computational tasks [52][56]. - The ReAct method combines external operations with reasoning trajectories, allowing models to incorporate external knowledge into their inference paths [56][57]. Group 7: Model Interpretability and Trustworthiness - The article highlights the importance of model interpretability, particularly through CoT, which allows for monitoring and understanding model behavior [59]. - However, there are concerns regarding the fidelity of CoT outputs, as biases and errors can affect the reliability of the reasoning process [62][64]. Group 8: Adaptive Computation and Token Utilization - Adaptive computation time allows models to dynamically adjust the number of computation steps during inference, enhancing their reasoning capabilities [81]. - Introducing special tokens, such as thinking tokens, can provide additional processing time and improve model performance on complex tasks [85][89].

思维链（CoT）

测试时计算

潜变量建模

Artificial Intelligence

思维链（CoT）

测试时计算

潜变量建模

Artificial Intelligence

通义实验室新研究：大模型自己「扮演」搜索引擎，提升推理能力无需搜索API

量子位· 2025-05-17 03:50

Core Insights - The article discusses the introduction of ZeroSearch, an open-source reinforcement learning framework developed by Alibaba's Tongyi Laboratory, which enhances the search capabilities of large language models (LLMs) without relying on real search engines [4][19]. Group 1: Challenges in Current Approaches - Current search engines produce unpredictable document quality, introducing noise and instability into the training process [2]. - Reinforcement learning (RL) training requires frequent deployments, leading to significant API costs that limit scalability [3]. Group 2: ZeroSearch Solution - ZeroSearch eliminates the need for interaction with real search engines, thus avoiding API costs and making large-scale RL training more economically feasible [19][36]. - The framework allows LLMs to become self-sufficient in search evolution through a simulated search environment and progressive noise-resistant training [6][19]. Group 3: Training Methodology - ZeroSearch employs lightweight fine-tuning to transform LLMs into "search engine simulators," enabling them to generate both useful results and noise interference with minimal labeled data [7][10]. - A curriculum-based noise training approach is introduced, where the model initially returns high-quality documents and gradually incorporates noise, enhancing training stability and effectiveness [12][14]. Group 4: Performance Metrics - Experimental results indicate that ZeroSearch requires only a 3 billion parameter LLM to significantly improve search capabilities while saving on API costs [5]. - ZeroSearch outperforms existing methods in both single-hop and multi-hop question-answering tasks, demonstrating superior retrieval capabilities [25][26]. Group 5: Compatibility with RL Algorithms - ZeroSearch is compatible with various RL algorithms, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), providing flexibility in training strategies [19][20]. - GRPO shows better training stability, while PPO offers higher flexibility in certain tasks, indicating that ZeroSearch can adapt to different algorithmic needs [21][34]. Group 6: Future Implications - The innovative approach of ZeroSearch addresses cost and stability issues present in current methods, paving the way for future advancements in intelligent retrieval systems [37].

Artificial Intelligence

Artificial Intelligence

OpenAI：GPT-5就是All in One，集成各种产品

量子位· 2025-05-17 03:50

Core Viewpoint - OpenAI is integrating its various models, including Codex, Operator, Deep Research, and Memory, into a unified system to enhance programming efficiency and reduce model switching [2][11]. Group 1: Codex Development and Efficiency - Codex was initially a side project aimed at improving internal workflows, resulting in a programming efficiency increase of approximately 3 times when utilized effectively [5][17]. - OpenAI is exploring flexible pricing models, including pay-per-use options for Codex [5]. - The team aims to create a high-performance engine that supports multiple programming languages, allowing developers to use their preferred languages for extensions [8]. Group 2: Future Plans and Integration - The future plan is to consolidate existing tools into a cohesive system that feels integrated, enhancing user experience [11]. - OpenAI is working on a product called Operator, which is currently in research preview but aims to execute tasks on computers, further expanding the capabilities of GPT-5 [10]. Group 3: User Interaction and Learning - Codex is designed to assist not only advanced engineers but also those looking to solve simpler problems, making it accessible to a broader audience [13]. - The model currently utilizes information loaded during container runtime, such as GitHub repositories, but does not access real-time library documentation [15]. - OpenAI is considering incorporating retrieval-augmented generation (RAG) technology to improve the model's access to up-to-date knowledge [15]. Group 4: Long-term Vision and Impact - The team envisions a future where software requirements can be efficiently and reliably transformed into runnable software versions [18]. - Codex is intended to enhance human developers' capabilities rather than replace them, particularly aiding novice programmers in their learning process [19]. Group 5: Additional Resources - OpenAI has released a "Codex Getting Started Guide," which includes basic introductions, GitHub connections, task submissions, and prompt tips [24][25].

检索增强生成（RAG）技术

Artificial Intelligence

检索增强生成（RAG）技术

Artificial Intelligence

OpenAI首席科学家帕乔茨基：AI已开始具备原创性研究能力

3 6 Ke· 2025-05-16 10:14

Core Insights - OpenAI's chief scientist, Jakub Pachocki, emphasizes that reinforcement learning is pushing AI models closer to the boundary of reasoning, indicating that AGI is transitioning from theory to reality [1][5] - The tension between open-source and safety is identified as a significant challenge in current AI development [1] - Future AI is expected to independently conduct original scientific research, advancing multiple disciplines such as software engineering and hardware design [1][2] Group 1: AI Development and Capabilities - AI models are currently able to engage in dialogue but require ongoing guidance; significant improvements in AI's role as an assistant are anticipated in the next five years [2] - Tools like OpenAI's "Deep Research" can run for 10 to 20 minutes without supervision, producing valuable content with minimal computational resources [2] - The expectation is that AI will achieve genuine original research capabilities, leading to major advancements in fields like automated software engineering and autonomous hardware design [2][3] Group 2: Reinforcement Learning and Reasoning - Recent advancements in reasoning models are largely attributed to significant improvements in the reinforcement learning phase, which enhances the model's practical outcomes and autonomy [3] - The evolution of reasoning capabilities is a gradual process built on pre-trained models, with ongoing research focused on integrating various methods and exploring their interactions [3] Group 3: Open Source and Safety - OpenAI plans to release its first open-source model since GPT-2, highlighting the importance of understanding the societal impacts of open-source deployment amid rapidly advancing model capabilities [4] - The release of cutting-edge models with open-source weights faces significant challenges due to safety risks, with the goal of outperforming existing open-source models while ensuring safety [4] Group 4: General Artificial Intelligence (AGI) - The definition and timeline for achieving AGI have evolved significantly; initial skepticism has shifted due to rapid technological advancements [5][6] - Major milestones in AGI are anticipated, particularly in generating measurable economic impacts and original research capabilities, with expectations for significant breakthroughs by the end of the decade [6] - AI is projected to autonomously develop valuable software soon, although solving major scientific problems may still be a distant goal [6]

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

泛化性暴涨47%！首个意图检测奖励范式，AI工具爆炸时代意图识别新解法

机器之心· 2025-05-16 04:39

Core Viewpoint - The rapid development of large language models (LLMs) and the explosion of integrable tools have significantly enhanced the convenience of AI assistants in daily life, but the challenges of intent detection and generalization remain critical issues [1][2]. Group 1: Research and Methodology - Tencent's PCG social line research team has innovatively applied reinforcement learning (RL) methods, specifically the Group Relative Policy Optimization (GRPO) algorithm combined with Reward-based Curriculum Sampling (RCS), to improve intent detection tasks [2]. - The research demonstrated that models trained with RL exhibit significantly better generalization capabilities compared to those trained with supervised fine-tuning (SFT), particularly in handling unseen intents and cross-lingual tasks [4]. - The introduction of a thought process during RL training has been shown to enhance the model's generalization ability in complex intent detection tasks [5]. Group 2: Experimental Results - The experiments revealed that the GRPO method outperformed the SFT method in terms of generalization performance across various datasets, including MultiWOZ2.2 and a self-built Chinese dataset, TODAssistant [17]. - The GRPO method achieved comparable performance to SFT on the MultiWOZ2.2 dataset, indicating its effectiveness in intent detection tasks [14]. - The results from the experiments indicated that the GRPO method, when combined with RCS, further improved the model's accuracy, especially in the second phase of curriculum learning [19]. Group 3: Future Directions - The research team plans to explore more efficient online data filtering methods for the RCS approach in future work [24]. - There is an intention to investigate multi-intent recognition, as current experiments primarily focus on single-intent scenarios [25]. - The team aims to extend their research to more complex task-oriented dialogue tasks beyond intent recognition [26].

Artificial Intelligence

Artificial Intelligence

机器人系列报告之二十七：控制器提供具身智能基座，数据飞轮驱动模型迭代

Shenwan Hongyuan Securities· 2025-05-15 15:20

Investment Rating - The report maintains a positive outlook on the humanoid robot industry, emphasizing the importance of software development for commercialization [3][4]. Core Insights - The report identifies that the hardware maturity of humanoid robots is currently higher than that of software, with software being the key to commercialization. It highlights the need for advancements in algorithms, data, and control systems to drive the industry forward [3][5][6]. Summary by Sections 1. Algorithms: The Core of Embodied Intelligence - The algorithm framework is divided into two levels: the upper "brain" focuses on task-level planning and decision-making, while the lower "cerebellum" handles real-time motion planning and joint control [3][11][18]. - The report discusses the evolution of control algorithms, noting a shift from traditional methods to modern approaches like reinforcement learning (RL) and imitation learning (IL) [3][19][29]. - The VLA (Vision-Language-Action) model is highlighted as a significant advancement in upper-level control, enabling robots to understand and execute tasks through natural language processing [3][36][40]. 2. Data: The Foundation of Algorithm Learning - Data quality and diversity are crucial for algorithm performance, with sources categorized into real data, synthetic data, and web data. Real data is the most accurate but least abundant [3][74][76]. - The report emphasizes the importance of remote operation and motion capture technologies for collecting high-quality real data [3][79]. 3. Control Systems: The Foundation of Embodied Intelligence - The control system is described as the "brain" of humanoid robots, consisting of hardware (SoC chips, CPUs, GPUs, NPUs) and software components [3][3][3]. - The report notes that the industry lacks a unified consensus on the structure of the "brain" and "cerebellum" in humanoid robots, which are essential for executing complex algorithms and tasks [3][3][3]. 4. Investment Opportunities - The report identifies several key companies in the humanoid robot industry worth monitoring, including: - Controller segment: Tianzhun Technology, Zhiwei Intelligent, Desay SV [4][4]. - Motion control technology: Huichuan Technology, Xinjie Electric, Leisai Intelligent, Gokong Technology, Tosida [4][4]. - Chip manufacturers: Rockchip, Horizon Robotics [4][4]. - Data collection equipment: Lingyun Optical, Aofei Entertainment [4][4].

人形机器人

人形机器人

锦秋基金臧天宇：2025年AI创投趋势

锦秋集· 2025-05-14 10:02

Core Insights - The article discusses the investment trends in the AI sector, highlighting a shift from foundational models to application layers as the core focus for investment opportunities [1][7][11]. Group 1: Domestic AI Investment Trends - JinQiu Capital's investment portfolio serves as a small sample window to observe domestic AI investment trends [2]. - Approximately 60% of the projects are concentrated in the application layer, driven by improved model intelligence and significantly reduced invocation costs [6][7]. - The investment focus has shifted from foundational models, particularly large language models (LLMs), to application-oriented projects as foundational model capabilities mature [6][7]. Group 2: Key Investment Areas - The application layer is the primary focus, with nearly 40% of investments in Agent AI, 20% in creative tools, and another 20% in content and emotional consumption [8]. - Bottom-layer computing power and Physical AI are also critical areas, with investments aimed at enhancing model training and inference capabilities [9][10]. - The middle layer/toolchain investments are limited, focusing on large model security and reinforcement learning infrastructure [10]. Group 3: Trends in AI Intelligence and Cost - The continuous improvement of AI intelligence and the decreasing cost of acquiring this intelligence are the two core trends driving investment decisions [12][13]. - The industry has shifted focus from pre-training scaling laws to optimizing post-training phases, leading to the emergence of "Test Time Scaling" [14][15]. - The "Agent AI" era is characterized by the development of various agents to address practical operational issues [15]. Group 4: Cost Reduction in AI - A significant decrease in token costs has been observed, with prices dropping to as low as 0.8 RMB per million tokens, making applications economically viable [19][20]. - The cost of reasoning models remains a challenge due to their higher token consumption, necessitating further innovations to reduce inference costs [21][22]. - Innovations in underlying computing architectures, such as processing-in-memory and optical computing, are expected to drive long-term cost reductions [23][24]. Group 5: Opportunities in the Application Layer - The combination of improved intelligence and reduced costs has led to a surge in entrepreneurial activity within the application layer [26]. - The AI era presents new variables, including richer information and service offerings, as well as more precise recommendations evolving into proactive services [29][30]. - The marginal cost of content creation and service execution has significantly decreased, enabling scalable and distributable service models [31][33]. Group 6: Future of Physical AI - The potential for achieving general-purpose robots in the Physical AI domain is highlighted as a key area for future development [37]. - Data remains a core challenge for the development of general-purpose robots, necessitating collaborative optimization of hardware and software [40].