强化学习（Reinforcement Learning） - filings, earnings calls, financial reports, news

强化学习（Reinforcement Learning）

Search documents

自动驾驶之心· 2025-10-31 00:06

Core Insights - The paper addresses challenges in deploying end-to-end autonomous driving (AD) algorithms in real-world scenarios, focusing on causal confusion and the open-loop gap [1][2] - It proposes a closed-loop reinforcement learning (RL) training paradigm based on 3D Gaussian Splatting (3DGS) technology to enhance the robustness of AD strategies [2][8] Summary by Sections Problem Statement - The paper identifies two main issues: causal confusion, where imitation learning (IL) captures correlations rather than causal relationships, and the open-loop gap, where IL strategies trained in an open-loop manner perform poorly in real-world closed-loop scenarios [1][2][6] Related Research - The paper references various fields related to the study, including dynamic scene reconstruction, end-to-end autonomous driving, and reinforcement learning, highlighting existing methods and their limitations [3][4][5][7] Proposed Solution - The proposed RAD framework integrates 3DGS technology with RL and IL, employing a three-stage training paradigm: perception pre-training, planning pre-training, and reinforced post-training [8][24] - It includes a specially designed safety-related reward function to guide the AD strategy in handling safety-critical events [11][24] Experimental Validation - The paper details extensive experiments, including data collection of 2000 hours of human expert driving demonstrations and the creation of 4305 high-collision-risk traffic clips for training and evaluation [15][24] - Nine key performance indicators (KPIs) are used to assess the AD strategy, including dynamic collision ratio (DCR) and static collision ratio (SCR) [12][15][24] Key Findings - The RAD framework outperforms existing IL methods, achieving a threefold reduction in collision rates (CR) and demonstrating superior performance in complex dynamic environments [9][12][24] - The optimal RL-IL ratio of 4:1 was found to balance safety and trajectory consistency effectively [12][15] Future Directions - The paper suggests further exploration in areas such as enhancing the interactivity of the 3DGS environment, improving rendering techniques, and expanding the application of RL [17][21][22][29]

强化学习（Reinforcement Learning）

模仿学习（Imitation Learning）

因果混淆（Causal Confusion）

开环训练与闭环部署之间的差距（Open-loop Gap）

Autos

端到端自动驾驶（End-to-End Autonomous Driving）

强化学习（Reinforcement Learning）

模仿学习（Imitation Learning）

因果混淆（Causal Confusion）

开环训练与闭环部署之间的差距（Open-loop Gap）

Autos

端到端自动驾驶（End-to-End Autonomous Driving）

卡卡卡卡卡……马卡龙是真的卡，但态度也是真的好

36氪· 2025-08-23 09:06

Core Viewpoint - The article discusses the emergence of a new Personal Agent called Macaron, which aims to provide personalized assistance and enhance user productivity by understanding individual needs and preferences [4][5][6]. Group 1: Product Overview - Macaron is positioned as an "AI that understands you," designed to create personalized tools based on user input [7][13]. - The app engages users in conversation, often asking questions to identify their needs and interests, resembling a lively and interactive personality [9][10]. - The development team aims for Macaron to deliver functional tools quickly, although the current output may be basic and require further optimization [21][22]. Group 2: User Interaction - Users have reported that Macaron actively seeks to create mini-apps based on conversational cues, demonstrating a proactive approach to fulfilling user needs [15][19]. - The app's interaction style is characterized by continuous engagement, where it maintains conversation while processing requests, akin to a product manager [22][30]. - Macaron's memory capabilities allow it to retain context from previous interactions, enhancing the user experience by providing relevant reminders and suggestions [31][34]. Group 3: Technical Aspects - Macaron utilizes a combination of Reinforcement Learning and Deep Memory to improve its memory retention and contextual understanding over time [36][37]. - The AI's ability to remember user preferences and past conversations contributes to a more personalized and engaging interaction, moving beyond traditional AI functionalities [38][39].

Personal Agent

强化学习（Reinforcement Learning）

马卡龙（Macaron）

ChatGPT

Personal Agent

强化学习（Reinforcement Learning）

马卡龙（Macaron）

ChatGPT

OTC‑PO重磅发布 | 揭开 o3 神秘面纱，让 Agent 少用工具、多动脑子！

机器之心· 2025-05-07 04:34

Core Insights - The article introduces a novel reinforcement learning framework called Optimal Tool Calling Policy Optimization (OTC-PO), which encourages language models to generate correct answers through optimal tool usage, focusing on both effectiveness and efficiency of tool interactions [22]. Group 1: Agent Behavior Patterns - Agents exhibit two primary behavior patterns: Reasoning and Acting, where Reasoning focuses on internal cognitive processes and Acting involves interaction with external tools and APIs [4][5]. - The article discusses the potential confusion between Reasoning and Acting behaviors when models overly focus on the correctness of final answers, leading to cognitive offloading and inefficient tool usage [5][16]. Group 2: Reward Function Design - Different reward functions are proposed to optimize the balance between Reasoning and Acting, aiming to minimize unnecessary tool calls while maximizing the model's reasoning capabilities [6][12]. - The article emphasizes the importance of defining a minimal number of tool calls required for a model to answer a question, which varies based on the model's capabilities and the problem's complexity [11]. Group 3: Performance Metrics - The proposed method achieves a 73.1% reduction in tool calls and a 229.4% increase in tool efficiency without sacrificing accuracy, demonstrating significant improvements in training time and model performance as model size increases [10][16]. - The OTC-PO framework shows superior performance in both in-domain and out-of-domain evaluations compared to existing models, indicating its robustness and adaptability across various scenarios [20]. Group 4: Cognitive Offloading - The article identifies cognitive offloading as a phenomenon where larger models tend to rely excessively on external tools, hindering their reasoning development, and suggests that minimizing tool calls can enhance the model's cognitive abilities [16][21]. - A case study illustrates that minimizing tool usage can lead to smarter tool application and improved reasoning capabilities, aligning with the desired behavior of models like OpenAI's o3 [21].

强化学习（Reinforcement Learning）

强化学习（Reinforcement Learning）

用多模态LLM超越YOLOv3！强化学习突破多模态感知极限｜开源

量子位· 2025-05-03 04:05

Core Insights - The article discusses the introduction of Perception-R1 (PR1), a groundbreaking multimodal large language model (MLLM) that surpasses previous models like YOLOv3 and Faster-RCNN by achieving over 30 AP on the COCO2017 validation set [1][16]. Group 1: Introduction of Perception-R1 - Perception-R1 is developed by research teams from Huazhong University of Science and Technology, Beijing University of Posts and Telecommunications, and others, focusing on enhancing visual reasoning through rule-based reinforcement learning (RL) [1][5]. - The model aims to improve capabilities in pure visual tasks such as counting and general object detection, as well as visual-language tasks like grounding and OCR [1][4]. Group 2: Importance of Visual Perception in AI - The article emphasizes the need for a revolution in AI visual perception, highlighting the rapid advancements in AI's ability to understand visual information, which is crucial for applications ranging from autonomous driving to medical diagnostics [3][4]. - It points out the subtle differences between recognizing objects and understanding their interactions in detail, indicating that current MLLMs often struggle with complex visual reasoning tasks [4]. Group 3: Role of Reinforcement Learning - The rise of reinforcement learning, particularly techniques like RLHF (Reinforcement Learning from Human Feedback) and rule-based RL, is noted as a transformative factor for language models, prompting the development of Perception-R1 [5][6]. - The article raises the question of whether RL can similarly enhance MLLM's visual perception capabilities, suggesting that early attempts have shown promise but not universal success [6]. Group 4: Perception-R1 Framework - Perception-R1 is not a new MLLM from scratch but a post-training framework designed to significantly enhance the visual perception abilities of existing capable MLLMs [7]. - It employs a technique called Group Relative Policy Optimization (GRPO) to optimize the perception strategy, which is crucial for improving visual task performance [9]. Group 5: Reward Engineering - The article discusses the importance of reward modeling in reinforcement learning, where the reward function guides the learning process by quantifying the model's performance on visual tasks [11]. - Perception-R1's reward structure includes extracting relevant visual details, executing logical operations based on visual understanding, and generating outputs in the correct format [11][17]. Group 6: Experimental Results - Perception-R1's performance is evaluated against strong benchmarks and specialized models, demonstrating significant improvements in visual counting and object detection tasks [16][19]. - For instance, in visual counting tasks, Perception-R1 achieved 78.1 on Pixmo-Count, outperforming other models [19]. Group 7: Scalability and Future Implications - The article concludes that Perception-R1 lays a critical foundation for future advancements in intelligent AI visual perception, suggesting that its principles could play a key role in developing next-generation perceptual AI systems [24][25].

多模态大语言模型（MLLM）

强化学习（Reinforcement Learning）

Artificial Intelligence

强化学习（Reinforcement Learning）

Artificial Intelligence

Perception - R1

YOLOv3

Faster - RCNN