强化学习 - filings, earnings calls, financial reports, news - Reportify

强化学习

Search documents

晚点独家丨Agent 初创公司 Pokee.ai 种子轮融资 1200 万美元，Point 72 创投，英特尔陈立武等投资

晚点LatePost· 2025-07-09 11:38

Core Viewpoint - Pokee.ai, an AI Agent startup, recently raised approximately $12 million in seed funding to accelerate research and sales efforts, with notable investors including Point72 Ventures and Qualcomm Ventures [5][6]. Group 1: Company Overview - Pokee.ai was founded in October 2022 and currently has only 7 employees. The founder, Zhu Zheqing, previously led the "Applied Reinforcement Learning" department at Meta, where he significantly improved the content recommendation system [7]. - Unlike other startups that use large language models (LLMs) as the "brain" of their agents, Pokee relies on a different reinforcement learning model that does not require extensive context input [7]. Group 2: Technology and Cost Efficiency - The current version of Pokee has been trained on 15,000 tools, allowing it to adapt to new tools without needing additional context [8]. - Using reinforcement learning models is more cost-effective compared to LLMs, which can incur costs of several dollars per task due to high computational demands. Pokee's task completion cost is only about 1/10 of its competitors [8]. Group 3: Market Strategy and Product Development - Pokee aims to optimize its ability to call data interfaces (APIs) across various platforms, targeting large companies and professional consumers to facilitate cross-platform tasks [9]. - The funding will also support the integration of new features, including a memory function to better understand client needs and preferences [9]. Group 4: Seed Funding Trends - The seed funding landscape for AI startups is evolving, with average seed round sizes increasing significantly. In 2020, the median seed round was around $1.7 million, which has risen to approximately $3 million in 2023 [10]. - The high costs associated with AI product development necessitate larger funding rounds to sustain operations, with some companies reportedly burning through $100 million to $150 million annually [13][14]. Group 5: Investment Climate - Investors are becoming more cautious, requiring solid product-market fit (PMF) before committing to funding. The median time between seed and Series A funding has increased to 25 months, the highest in a decade [17][18].

大型语言模型（LLM）

Artificial Intelligence

大型语言模型（LLM）

Artificial Intelligence

如何教AI学会反思？

Hu Xiu· 2025-07-09 07:57

Core Insights - The article discusses a research paper titled "Reflect, Retry, Reward: Self-Improvement of Large Language Models through Reinforcement Learning," which presents a novel approach for AI to learn from its mistakes [5][6][10]. Group 1: Research Overview - The research team from an AI startup called Writer, consisting of eight authors, published the paper, which ranked third in the June leaderboard of the Hugging Face platform [3][4]. - The paper emphasizes a three-step process for AI to learn from errors: Reflect, Retry, and Reward [5][10]. Group 2: Learning Mechanism - The first step, Reflect, involves the AI generating a self-reflection on its mistakes after failing a task, similar to how students analyze their errors [11]. - The second step, Retry, allows the AI to attempt the same task again, armed with insights from its reflection [12]. - The third step, Reward, applies reinforcement learning to adjust the model's parameters based on the effectiveness of its reflection, rather than just the final answer [13][14]. Group 3: Experimental Validation - The research team conducted two experiments: one on function calling and another on solving mathematical equations, both of which are challenging tasks with clear success criteria [16][18]. - In the function calling task, a model with 1.5 billion parameters improved its first-attempt accuracy from approximately 32.6% to 48.6% after implementing the reflection mechanism, and to 52.9% after a retry [20][21]. - For the mathematical equation solving task, the same model's accuracy increased from 6% to 34.9% on the first attempt, and to 45% after a retry, demonstrating significant improvement [23][24][25]. Group 4: Implications for AI Development - The findings suggest that smaller models can outperform larger models when trained with effective learning strategies, indicating that model size is not the only determinant of performance [26][29]. - The research highlights the potential for optimizing training methods to enhance the capabilities of smaller models, which can lead to cost savings in AI development [29].

Artificial Intelligence

Artificial Intelligence

DeepSeek-R1超级外挂！“人类最后的考试”首次突破30分，上海交大等开源方案碾压OpenAI、谷歌

量子位· 2025-07-09 04:57

Core Insights - The article highlights a significant achievement by a domestic team from Shanghai Jiao Tong University and DeepMind Technology, which scored 32.1 points on the "Humanity's Last Exam" (HLE), setting a new record in a notoriously difficult AI test [1][2][26]. Group 1: Achievement and Context - The previous highest score on the HLE was 26.9, achieved by Kimi-Research and Gemini Deep Research [2]. - The HLE was launched earlier this year and is known for its extreme difficulty, with no model scoring above 10 points initially [34][39]. - The test includes over 3,000 questions across various disciplines, with a significant focus on mathematics [39]. Group 2: Methodology and Tools - The team developed two key systems: the tool-enhanced reasoning agent X-Master and the multi-agent workflow system X-Master s [3][20]. - X-Master operates by simulating the dynamic problem-solving process of human researchers, allowing for seamless switching between internal reasoning and external tool usage [9][10]. - The core mechanism involves conceptualizing code as an interactive language, enabling the agent to generate and execute code when faced with unsolvable problems [11][14]. Group 3: Performance Metrics - The X-Masters system achieved a record score of 32.1%, surpassing all existing agents and models [26]. - The performance improvement was attributed to various components of the workflow: tool-enhanced reasoning improved baseline accuracy by 3.4%, iterative optimization added 9.5%, and final selection led to the record score [29][30]. - In specific categories, X-Masters outperformed existing systems, achieving 27.6% accuracy in the biology/medicine category, compared to 17.3% for Biomni and 26% for STELLA [31]. Group 4: Future Implications - The introduction of X-Master s aims to enhance the breadth and depth of reasoning through a decentralized-stacked approach, where multiple agents collaborate to generate and refine solutions [20][22]. - This structured exploration and exploitation strategy is likened to concepts in reinforcement learning, indicating a potential for further advancements in AI reasoning capabilities [23].

工具增强推理

多智能体工作流

Artificial Intelligence

工具增强推理智能体X - Master

多智能体工作流系统X - Masters

工具增强推理

多智能体工作流

Artificial Intelligence

工具增强推理智能体X - Master

多智能体工作流系统X - Masters

4B小模型数学推理首超Claude 4，700步RL训练逼近235B性能 | 港大&字节Seed&复旦

量子位· 2025-07-09 01:18

Core Viewpoint - The Polaris model, developed by a collaboration between the University of Hong Kong's NLP team, ByteDance Seed, and Fudan University, demonstrates superior mathematical reasoning capabilities compared to leading commercial models, achieving scores of 79.4 on AIME25 and 81.2 on AIME24 [1][53]. Group 1: Model Performance and Training - Polaris utilizes Scaling Reinforcement Learning (RL) to enhance the mathematical reasoning abilities of the 4B model, surpassing various commercial models such as Seed-1.5-thinking and Claude-4-Opus [1][5]. - The lightweight nature of Polaris-4B allows deployment on consumer-grade graphics cards [2]. - The research team confirmed that Scaling RL can replicate significant performance improvements in cutting-edge open-source models like Qwen3 [5]. Group 2: Training Data and Methodology - The success of Polaris hinges on tailored training data and hyperparameter settings that align with the model being trained [7]. - The team discovered a mirrored difficulty distribution in the training data, indicating that the same dataset presents varying challenges to models of different capabilities [8][10]. - A dynamic updating strategy for training data was implemented, allowing the model to adapt as it improves, ensuring that overly easy samples are removed during training [13]. Group 3: Sampling Diversity and Temperature Control - Diversity in sampling is crucial for enhancing model performance, allowing exploration of broader reasoning paths [14]. - The team identified that common temperature settings (0.6 and 1.0) were too low, limiting the model's exploration capabilities [27]. - A three-zone temperature framework was established: Robust Generation Zone, Controlled Exploration Zone, and Performance Collapse Zone, guiding the selection of optimal sampling temperatures [28]. Group 4: Long Context Training and Performance - The model's pre-training context length was limited to 32K, but during RL training, it was extended to 52K, addressing the challenge of long-context training [37]. - The introduction of length extrapolation techniques improved the accuracy of long text generation from 26% to over 50% [41]. - A multi-stage training approach was adopted, gradually increasing context window lengths to enhance reasoning capabilities [48]. Group 5: Evaluation and Results - Polaris achieved the highest performance in most evaluations, demonstrating its effectiveness in mathematical reasoning tasks [53].

具身智能论文速递 | 强化学习、VLA、VLN、世界模型等~

具身智能之心· 2025-07-08 12:54

Core Insights - The article discusses advancements in Vision-Language-Action (VLA) models through reinforcement learning (RL) techniques, specifically the Proximal Policy Optimization (PPO) algorithm, which significantly enhances the generalization capabilities of these models [2][4]. Group 1: VLA Model Enhancements - The application of PPO has led to a 42.6% increase in task success rates in out-of-distribution (OOD) scenarios [2]. - Semantic understanding success rates improved from 61.5% to 75.0% when encountering unseen objects [2]. - In dynamic interference scenarios, success rates surged from 28.6% to 74.5% [2]. Group 2: Research Contributions - A rigorous benchmark was established to evaluate the impact of VLA fine-tuning methods on generalization across visual, semantic, and execution dimensions [4]. - PPO was identified as superior to other RL algorithms like GRPO and DPO for VLA fine-tuning, with discussions on adapting these algorithms to meet the unique needs of VLA [4]. - An efficient PPO-based fine-tuning scheme was developed, utilizing a shared actor-critic backbone network, VLA model preheating, and minimal PPO training iterations [4]. - The study demonstrated that RL's generalization capabilities in VLA for semantic understanding and entity execution outperformed supervised fine-tuning (SFT), while maintaining comparable visual robustness [4]. Group 3: NavMorph Model - The NavMorph model was introduced as a self-evolving world model for vision-and-language navigation in continuous environments, achieving a success rate of 47.9% in unseen environments [13][15]. - The model incorporates a World-aware Navigator for inferring dynamic representations of the environment and a Foresight Action Planner for optimizing navigation strategies through predictive modeling [15]. - Experiments on mainstream VLN-CE benchmark datasets showed that NavMorph significantly enhanced the performance of leading models, validating its advantages in adaptability and generalization [15].

视觉语言导航（VLN）

视觉-语言-动作模型（VLA）

视觉语言导航自演进世界模型NavMorph

视觉语言导航（VLN）

视觉-语言-动作模型（VLA）

视觉语言导航自演进世界模型NavMorph

重磅分享！VR-Robo：real2sim2real助力真实场景下的机器人导航和运动控制

具身智能之心· 2025-07-08 09:31

Core Viewpoint - The article discusses the limitations of foot robots in real-world applications due to the gap between simulation and reality, particularly in high-level tasks requiring RGB perception. It introduces a "Real-Sim-Real" framework that enhances visual navigation and motion control through a digital twin simulation environment [2]. Group 1 - The movement control of foot robots benefits from the combination of reinforcement learning and physical simulation, but is hindered by the lack of realistic visual rendering [2]. - The proposed "Real-Sim-Real" framework utilizes multi-view images for 3D Gaussian splatting (3DGS) scene reconstruction, creating a simulation environment that combines photo-realism with physical interaction characteristics [2]. - Experiments in the simulator demonstrate that the method supports the transfer of reinforcement learning strategies from simulation to reality using pure RGB input, facilitating rapid adaptation and efficient exploration in new environments [2]. Group 2 - The framework shows potential applications in home and factory settings, indicating its relevance for practical deployment in various environments [2]. - The paper titled "VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion" is linked for further reading [3]. - Additional project details can be found on the provided project link [3].

VR - Robo: real2sim2real framework

VR - Robo: real2sim2real framework

多模态模型学会“按需搜索”，少搜30%还更准！字节&NTU新研究优化多模态模型搜索策略

量子位· 2025-07-08 07:30

MMSearch-R1团队投稿量子位 | 公众号 QbitAI 多模态模型学会"按需搜索"！字节&NTU最新研究，优化多模态模型搜索策略 —— 通过搭建网络搜索工具、构建多模态搜索数据集以及涉及简单有效的奖励机制，首次尝试基于端到端强化学习的多模态模型自主搜索训练。经过训练的模型能够自主判断搜索时机、搜索内容并处理搜索结果，在真实互联网环境中执行多轮按需搜索。实验结果表明，在知识密集型视觉问答任务（Visual Question Answering, VQA）中，MMSearch-R1系统展现出显著优势：其性能不仅超越同规模模型在传统检索增强生成（RAG）工作流下的性能，更在减少约30%搜索次数的前提下，达到了更大规模规模模型做传统RAG的性能水平。下文将详细解析该研究的研究方法以及实验发现。具体怎么做到的？近年来，随着视觉-语言训练数据集在规模和质量上的双重提升，多模态大模型（Large Multimodal Models, LMMs）在跨模态理解任务中展现出卓越的性能，其文本与视觉知识的对齐能力显著增强。然而，现实世界的信息具有高度动态性和复杂性，单 ...

多模态大模型

多模态大模型

突破全模态AI理解边界：引入上下文强化学习，赋能全模态模型“意图”推理新高度

量子位· 2025-07-08 07:30

Core Viewpoint - The article emphasizes the increasing need for deep understanding and analysis of human intent in the context of multimodal large language models (MLLMs) and highlights the challenges faced in applying reinforcement learning (RL) effectively to complex multimodal data and formats [1][4]. Group 1: Challenges in Multimodal Reasoning - Insufficient global context understanding leads to incorrect answers when models fail to accurately identify or misinterpret multimodal evidence and contextual information [3]. - The shortcut problem arises when models overlook key clues and provide answers without fully considering multimodal information, resulting in suboptimal or partial outcomes [4]. Group 2: Innovations and Advantages - HumanOmniV2 introduces a mandatory context summarization before reasoning, ensuring models do not skip critical multimodal input and providing comprehensive global background support [12]. - A multidimensional reward mechanism is implemented, including context reward, format reward, and accuracy reward, to guide models in accurately understanding multimodal context [13][14]. - The model encourages complex logical reasoning by evaluating whether the reasoning process successfully integrates multimodal information and employs advanced logical analysis techniques [15]. Group 3: Model Design and Training Strategies - The model is based on Qwen2.5-Omni-Thinker, with improvements to the Group Relative Policy Optimization (GRPO) method to enhance training efficiency, fairness, and robustness [19][20]. - Token-level loss is introduced to address the imbalance in long sequence training, ensuring balanced optimization for each token [19]. - The removal of question-level normalization terms promotes consistency in the optimization process across different problem difficulties [19]. - Dynamic KL divergence is utilized to enhance exploration capabilities and training stability throughout the training cycle [20]. Group 4: High-Quality Datasets and Benchmarks - A comprehensive multimodal reasoning training dataset has been created, incorporating image, video, and audio understanding tasks with rich contextual information [23]. - IntentBench, a new multimodal benchmark, evaluates models' abilities to understand human behavior and intent in complex scenarios, featuring 633 videos and 2,689 related questions [23]. Group 5: Experimental Results - HumanOmniV2 achieved breakthrough results across multiple benchmark datasets, attaining 58.47% on Daily-Omni, 47.1% on WorldSense, and 69.33% on the newly introduced IntentBench, outperforming existing open-source multimodal models [24].

多模态大语言模型

Artificial Intelligence

Qwen2.5-Omni-Thinker

多模态大语言模型

Artificial Intelligence

Qwen2.5-Omni-Thinker

RL 圈的夏夜之约！12 人唠嗑局：当强化学习撞上大模型 Agent

机器之心· 2025-07-08 04:09

Core Viewpoint - The article promotes an event titled "Reinforcement Learning New Paradigm Exploration Night," emphasizing the integration of reinforcement learning (RL) with large model agents, highlighting its significance in the current technological landscape [2][3]. Event Details - The event is scheduled for July 26, 2025, from 19:00 to 21:10, located near the Shanghai Expo Exhibition Center, aiming for an intimate gathering of only 12 participants to facilitate deep discussions [3][4]. - The event will cover three main topics: the synergy between reinforcement learning and large model agents, the dilemma of exploration versus stability in training strategies, and the challenges of aligning and evaluating intelligent agents [4]. Target Audience - The event is designed for individuals from academia, industry, and entrepreneurship, encouraging participants to bring their latest research, practical experiences, and product challenges for collaborative discussions [5][6]. - The focus is on fostering an environment for lively exchanges of ideas rather than formal presentations, aiming for a dynamic and engaging atmosphere [6][7]. Participation Information - Interested participants are encouraged to scan a QR code to express their identity (academic, industry, or entrepreneurial) and the specific RL challenges they wish to discuss, with limited spots available [8]. - The article emphasizes the importance of engaging in meaningful technical discussions and debates, suggesting that the event will provide a unique opportunity for networking and collaboration [9].

大模型智能体

大模型智能体

复盘国内外AI，兼论恒生科技

小熊跑的快· 2025-07-07 09:45

Market Overview - After April 7, both the US and Chinese stock markets experienced a rally, with the Nasdaq rising by 32.9%, the Hang Seng Tech Index ETF (513180) increasing by 11.57%, and the Shanghai Composite Index gaining 12.16% [1] AI Chip Market Dynamics - The focus has shifted from training GPUs to AI inference ASIC chips, driven by a slowdown in the iteration of foundational models under the transformer architecture [3][5] - The rental prices for training chips like H100 and H200 have declined since February, influenced by the industry's pivot towards reinforcement learning (RL) [5][6] - The upcoming GPT-5 model is expected to emphasize RL, which has a smaller demand compared to the pre-training phase [5] Data Source Considerations - A significant portion of the training data for GPT-5 is synthetic, raising concerns about the quality and sourcing of training data for future models [6] - The competition in the coding domain, particularly between Claude4 and Cursor, highlights the necessity for models to specialize in industry-specific data to maintain value [6] Token Usage Growth - Microsoft reported a token volume exceeding 100 trillion in Q1 2025, a fivefold increase year-on-year, while Google's monthly token processing surged from 9.7 trillion to 480 trillion, a growth of approximately 50 times [7] - Domestic AI models, such as Doubao, saw daily token usage exceed 16.4 trillion in May, marking a growth of over 4 times compared to the end of 2024 [7] ASIC Chip Outlook - The current market environment favors the development of inference ASIC chips, as existing models are sufficiently accurate for application [8][9] - The anticipated return of ASIC chips in Q3 is expected to alleviate supply issues faced in the first two quarters [9][10] - The overall sentiment towards the Hang Seng Tech Index is cautiously optimistic, with expectations of a rebound in capital expenditures (capex) [10] Future Projections - The ASIC chip market is projected to see significant growth from 2025 to 2027, coinciding with the next major architectural shift in foundational models [10] - Companies like Microsoft and Amazon are expected to continue their ASIC chip design efforts, with no immediate acknowledgment of failures in early generations [10]

推理ASIC芯片

推理ASIC芯片