Reinforcement Learning - filings, earnings calls, financial reports, news - Reportify

Reinforcement Learning

Search documents

3B Image Captioning小钢炮重磅来袭，性能比肩Qwen2.5-VL-72B

机器之心· 2025-10-28 04:31

Core Insights - The article introduces a new technology in Dense Image Captioning called CapRL (Captioning Reinforcement Learning), which successfully applies reinforcement learning methods to image captioning tasks, redefining the reward system based on practicality [2][6][10] - The CapRL-3B model achieves captioning performance comparable to Qwen2.5-VL-72B, marking a significant advancement in the field of image captioning and providing important insights for applying GRPO strategies to open tasks [2][12] Summary by Sections Introduction to CapRL - CapRL is a novel approach that addresses the challenge of designing rewards for subjective image description tasks by defining objective verifiable rewards based on practicality [6][10] - The model has been trained to generate high-quality captions that improve upon previous methods, avoiding issues like reward hacking [8][10] Limitations of Existing Methods - Most current image captioning models rely on supervised fine-tuning (SFT), which has limitations such as high costs and lack of generalization due to dependence on large, manually annotated datasets [7][8] - The subjective nature of image descriptions complicates the design of reliable reward functions, leading to potential issues in model training [7][8] CapRL Framework - The CapRL framework employs a two-stage decoupled training process where a language model answers visual questions based on generated captions, using the accuracy of these answers as an objective reward signal [10][13] - This innovative approach significantly enhances the quality of generated captions, improving accuracy and detail coverage while reducing hallucinations [10][11] Experimental Results - The CapRL-3B model was evaluated on the CapRL-5M dataset, showing significant performance improvements across 12 benchmark tests compared to previous models like ShareGPT4V and DenseFusion [12][14] - In direct assessments of caption quality, CapRL-3B's performance is comparable to that of larger models, demonstrating an average improvement of 8.4% over baseline models [12][15] Conclusion and Future Work - The CapRL framework has been open-sourced, with ongoing iterations to enhance its capabilities, inviting further use and exploration by the community [12][19]

Image Captioning

Reinforcement Learning

Multi-modal LLMs

Image Captioning

Reinforcement Learning

Multi-modal LLMs

DeepMind再登Nature：AI Agent造出了最强RL算法

3 6 Ke· 2025-10-28 00:35

Core Insights - The main objective of artificial intelligence (AI) is to design agents capable of autonomously predicting, acting, and achieving goals in complex environments. The challenge has been to enable these agents to independently develop efficient reinforcement learning (RL) algorithms [1][2]. Group 1: Discovery Methodology - Google DeepMind introduced a method called DiscoRL, which allows agents to autonomously discover RL rules through interactions in various environments. This method outperformed existing RL algorithms in both known and challenging benchmark tests [1][2]. - The discovery process involves two types of optimization: agent optimization and meta-optimization. Agents optimize their parameters by updating their strategies and predictions, while the meta-network optimizes the goals of the RL rules to maximize cumulative rewards [3][5]. Group 2: Performance Evaluation - DiscoRL was evaluated using the interquartile mean (IQM) as a performance metric, demonstrating superior performance over existing RL algorithms like MuZero and Dreamer in the Atari benchmark tests [7][8]. - The Disco57 rule, trained on 57 Atari games, achieved an IQM score of 13.86, surpassing all current RL rules and showing significant efficiency improvements over MuZero [8][14]. Group 3: Generalization and Robustness - The generalization capability of Disco57 was tested across 16 independent benchmark tests, outperforming all published methods, including MuZero and PPO. It also showed competitive performance in the Crafter benchmark and ranked third in the NetHack NeurIPS 2021 challenge without using domain-specific knowledge [9][11]. - Disco103, discovered in 103 environments, demonstrated comparable performance to Disco57 in Atari benchmarks and reached human-level performance in the Crafter benchmark, indicating that more complex and diverse environments lead to stronger and more generalizable RL rules [11][14]. Group 4: Efficiency and Scalability - The optimal performance of Disco57 was achieved within approximately 600 million steps per game, significantly more efficient than traditional human-designed RL rules, which require more experimental iterations and time [14][18]. - The performance of the discovered RL rules improved with the increase in the number of training environments, suggesting that the effectiveness of the discovered RL is dependent on the data (environments) and computational resources available [14][17].

Artificial Intelligence

Reinforcement Learning

Artificial Intelligence

Artificial Intelligence

Reinforcement Learning

Artificial Intelligence

Avi Chawla· 2025-10-23 20:02

Core Concept of Memento - Memento reframes continual learning as memory-based online reinforcement learning over a memory-augmented MDP, learning from experiences using memory instead of updating LLM weights [2] - Memento aims to improve AI agent performance from experience without fine-tuning LLM weights [1] Key Components - Case-Based Reasoning (CBR) decomposes complex tasks into sub-tasks and retrieves relevant past experiences [2] - Executor executes each subtask using MCP tools and records outcomes in memory for future reference [3] MCP Tools - MCP tools enable the executor to accomplish most real-world tasks [3] - MCP tools include Web research, Document handling, Safe Python execution, Data analysis, and Media processing [3]

Memory-based online reinforcement learning

Reinforcement Learning

Case-Based Reasoning

Memory-based online reinforcement learning

Reinforcement Learning

Case-Based Reasoning

拆电脑比装电脑还难？这只“手术级”机械手正在破解电子垃圾困局

机器人大讲堂· 2025-10-23 14:37

Core Viewpoint - The article discusses the challenges and innovations in the recycling of electronic waste, particularly focusing on the development of a specialized robotic claw, DeGrip, designed for dismantling electronic devices efficiently and effectively [1][26]. Group 1: Technological Challenges - Electronic waste (EOL) dismantling is a crucial part of the circular economy, but it is technically challenging due to the complexity and variability of different manufacturers' products [1]. - Traditional industrial robots excel in assembly but are rarely used in dismantling due to their limited flexibility in confined spaces [2][4]. Group 2: Innovation in Robotics - DeGrip is a newly designed robotic claw that combines small size and high flexibility, allowing it to operate in tight spaces within electronic devices [4][5]. - The claw features three degrees of freedom (DOF), enabling it to perform complex dismantling tasks with precision [5][11]. - The use of a cable-driven mechanism allows for a compact design that can navigate tight spaces while maintaining efficiency [6][7]. Group 3: Simulation and Testing - Prior to physical testing, DeGrip was evaluated in a virtual environment using a digital model of a desktop computer to assess its performance in dismantling tasks [12][20]. - The simulation tasks included removing RAM modules, SSDs, and HDDs from confined spaces, demonstrating DeGrip's adaptability and precision [14][16][18][20]. Group 4: Prototype Development - A physical prototype of DeGrip was created using 3D printing and tested in real-world scenarios, confirming its structural integrity and responsiveness [22][24]. - The prototype's performance validated the reliability of the cable-driven design and its feasibility for practical applications [24]. Group 5: Future Directions - The next phase involves using DeGrip to gather operational data for developing intelligent learning systems, enabling robots to learn autonomous dismantling strategies [26]. - This innovation aims to enhance the efficiency of electronic waste recycling, contributing to a more sustainable circular economy [27].

Circular Economy

Imitation Learning

Reinforcement Learning

Circular Economy

Imitation Learning

Reinforcement Learning

从几个代表性的工作分析强化学习和VLA是怎么结合的？挑战有哪些？

具身智能之心· 2025-10-22 03:04

Core Insights - The article discusses the integration of reinforcement learning (RL) with Visual-Language-Action (VLA) models to enhance robotic capabilities, enabling robots to understand visual and linguistic instructions while optimizing their actions through trial and error [2][8]. Group 1: VLA and Reinforcement Learning Integration - The combination of VLA models and RL allows robots to interpret tasks and adjust their actions based on feedback, improving their performance in complex environments [2][3]. - The GRAPE framework enhances the generalization of robotic policies by aligning preferences, breaking down complex tasks into manageable stages, and optimizing actions through RL, resulting in a success rate increase of 51.79% for seen tasks and 58.20% for unseen tasks [6][7]. Group 2: Addressing Generalization Challenges - VLA models struggle with generalization in unfamiliar scenarios; however, the VLA-RL framework models the robotic operation as a multi-turn dialogue, achieving higher success rates in 40 complex tasks compared to pure imitation learning [8][10]. - The ReWiND framework generates flexible reward functions through language descriptions, allowing robots to adapt to new tasks with a learning efficiency that is twice as fast in simulations and five times faster in real-world applications [12][14]. Group 3: Fine-Tuning Strategies - The ConRFT framework combines offline and online fine-tuning methods, achieving an average success rate of 96.3% across eight real-world tasks, significantly improving performance compared to traditional supervised learning [15][18]. - The Dual-Actor framework utilizes a pre-trained VLA model to master basic actions before fine-tuning through RL, enhancing the robot's ability to perform complex assembly tasks with higher success rates [20][22]. Group 4: Safety and Efficiency - Safety mechanisms are integrated into RL processes to prevent collisions and damage during robotic exploration, ensuring a secure and efficient learning environment [23][24]. - The article emphasizes the importance of designing efficient multi-modal encoders to address the challenges of integrating visual, linguistic, and action data, which can lead to information loss [27][28].

Visual - Language - Action (VLA)

Reinforcement Learning

GRAPE framework

VLA - RL framework

Visual - Language - Action (VLA)

Reinforcement Learning

GRAPE framework

VLA - RL framework

自动驾驶论文速递！VLA、世界模型、强化学习、轨迹规划等......

自动驾驶之心· 2025-10-18 04:00

Core Insights - The article discusses advancements in autonomous driving technologies, highlighting various research contributions and their implications for the industry. Group 1: DriveVLA-W0 - The DriveVLA-W0 training paradigm enhances the generalization ability and data scalability of VLA models by using world modeling to predict future images, achieving 93.0 PDMS and 86.1 EPDMS on NAVSIM benchmarks [6][12] - A lightweight Mixture-of-Experts (MoE) architecture reduces inference latency to 63.1% of the baseline VLA, meeting real-time deployment needs [6][12] - The data scaling law amplification effect is validated, showing significant performance improvements as data volume increases, with a 28.8% reduction in ADE and a 15.9% decrease in collision rates when using 70M frames [6][12] Group 2: CoIRL-AD - The CoIRL-AD framework combines imitation learning and reinforcement learning within a latent world model, achieving an 18% reduction in collision rates on the nuScenes dataset and a PDMS score of 88.2 on the Navsim benchmark [13][16] - The framework integrates RL into an end-to-end autonomous driving model, addressing offline RL's scene expansion issues [13][16] - A decoupled dual-policy architecture facilitates structured interaction between imitation learning and reinforcement learning, enhancing knowledge transfer [13][16] Group 3: PAGS - The Priority-Adaptive Gaussian Splatting (PAGS) framework achieves high-quality real-time 3D reconstruction in dynamic driving scenarios, with a PSNR of 34.63 and SSIM of 0.933 on the Waymo dataset [23][29] - PAGS incorporates semantic-guided pruning and regularization to balance reconstruction fidelity and computational cost [23][29] - The framework demonstrates a rendering speed of 353 FPS with a training time of only 1 hour and 22 minutes, outperforming existing methods [23][29] Group 4: Flow Planner - The Flow Planner achieves a score of 90.43 on the nuPlan Val14 benchmark, marking the first learning-based method to surpass 90 without prior knowledge [34][40] - It introduces fine-grained trajectory tokenization to enhance local feature extraction while maintaining motion continuity [34][40] - The architecture employs adaptive layer normalization and scale-adaptive attention to filter redundant information and strengthen key interaction information extraction [34][40] Group 5: CymbaDiff - The CymbaDiff model defines a new task for sketch-based 3D outdoor semantic scene generation, achieving a FID of 40.74 on the Sketch-based SemanticKITTI dataset [44][47] - It introduces a large-scale benchmark dataset, SketchSem3D, for evaluating 3D semantic scene generation [44][47] - The model employs a Cylinder Mamba diffusion mechanism to enhance spatial coherence and local neighborhood relationships [44][47] Group 6: DriveCritic - The DriveCritic framework utilizes vision-language models for context-aware evaluation of autonomous driving, achieving a 76.0% accuracy in human preference alignment tasks [55][58] - It addresses limitations of existing evaluation metrics by focusing on context sensitivity and human alignment in nuanced driving scenarios [55][58] - The framework demonstrates superior performance compared to traditional metrics, providing a reliable solution for human-aligned evaluation in autonomous driving [55][58]

Reinforcement Learning

3D Reconstruction

Trajectory Planning

Autonomous Driving

Autonomous Driving

Reinforcement Learning

3D Reconstruction

Trajectory Planning

Autonomous Driving

Autonomous Driving

「重要性采样」并不「重要」？快手清华ASPO攻克重要性采样权重错配

量子位· 2025-10-15 10:20

Core Insights - Reinforcement Learning (RL) has become a crucial component in the post-training phase of Large Language Models (LLMs) like ChatGPT and DeepSeek [1] - A significant issue has emerged with the increasing scale of model parameters: the importance sampling (IS) mechanism may not be as beneficial as previously thought [2][5] - The research team from Kuaishou and Tsinghua University identified a deep-rooted "weight mismatch" phenomenon in existing supervised RL paradigms, leading to overconfidence in models and potential issues like entropy collapse and premature convergence [2][6] Importance Sampling Issues - Importance sampling is intended to correct the distribution differences between old and new policies, allowing models to reuse old data without deviating from the target distribution [5] - In small-scale RL, IS is effective; however, it fails in the context of supervised RL for large language models [6] - Experiments showed that in GRPO algorithms, IS did not provide the expected benefits and instead contributed to training instability [7] Weight Mismatch and Self-Reinforcing Loops - The research revealed that the advantage values in supervised RL are inaccurate, as different tokens contribute differently to the final answer [8] - The average IS weight for positive advantage tokens is higher than for negative ones, leading to a decrease in entropy [9] - IS in supervised RL algorithms has shifted from being a correction term to a token-level weight, causing a self-reinforcing loop that reinforces high-scoring tokens while neglecting low-probability ones [11][12] ASPO Algorithm Introduction - The proposed ASPO (Asymmetric Importance Sampling Policy Optimization) algorithm addresses these issues by inverting the IS weights for positive advantage tokens, allowing low-probability tokens to receive stronger updates [3][18] - ASPO incorporates a Dual-Clipping mechanism to manage extreme values resulting from the inverted weights, ensuring stability while maintaining effective gradient flow [20] Experimental Results - ASPO demonstrated significant advantages in various benchmarks, including mathematical reasoning and code generation tasks, outperforming traditional methods [24] - The average performance improvement was 12.5% for mathematical tasks and 17.0% for code generation tasks, with smoother training curves and reduced entropy collapse [26] - ASPO achieved notable results in the LiveCodeBench v5 benchmark, indicating its superiority over mainstream RL methods [26][27]

Importance Sampling

Reinforcement Learning

Large Language Model

Artificial Intelligence

ASPO (Asymmetric Importance Sampling Policy Optimization)

Importance Sampling

Reinforcement Learning

Large Language Model

Artificial Intelligence

ASPO (Asymmetric Importance Sampling Policy Optimization)

开源编程模型王座易主了，谁能想到新SOTA是快手

量子位· 2025-10-11 06:04

Core Insights - The article highlights the emergence of KAT-Dev-72B-Exp from Kuaishou as the leading open-source programming model, achieving a score of 74.6% on the SWE-Bench certification leaderboard [1][4]. Group 1: Model Performance - KAT-Dev-72B-Exp is an experimental reinforcement learning version of the KAT-Coder model, which has also outperformed GPT-5 (non-Codex mode) and Claude 4 Sonnet on the SWE-Bench certification [3][4]. - KAT-Coder demonstrates capabilities such as recreating a complete version of the game "Fruit Ninja" within a web environment, including scoring and life systems [6]. Group 2: Visualization and Interaction - The model excels in visualizing physical laws through code, with examples including a cyberpunk clock that triggers explosion effects and a solar system simulation created using three.js [10][13]. - KAT-Coder can generate interactive effects and animations that adhere to real physical principles, such as a 60-story building collapse simulation [15]. Group 3: Key Technologies - KAT-Coder employs multiple training phases, including mid-training, supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), leading to emergent behaviors in the model [17][25]. - The model's interaction count required to complete tasks decreased by 32% after reinforcement learning, indicating improved efficiency [26]. Group 4: Industrial-Grade Framework - Kuaishou's self-developed industrial-grade reinforcement learning framework, SeamlessFlow, supports complex scenarios like multi-agent and online reinforcement learning [28][29]. - SeamlessFlow has shown a 100% throughput improvement in single-round RL tasks and a 62% reduction in overall training time compared to mainstream VERL frameworks [35]. Group 5: Training Optimization - The introduction of Trie Packing mechanism and the restructuring of the training engine allow KAT-Dev-72B-Exp to efficiently train on shared prefix trajectories, achieving an average speed increase of 2.5 times [37].

KUAISHOU(HK:01024)

Artificial Intelligence

Reinforcement Learning

KAT-Dev-72B-Exp

Artificial Intelligence

Reinforcement Learning

KAT-Dev-72B-Exp

不是玄学！港科大清华等联手：撕开推理黑箱，RL让AI像人思考

具身智能之心· 2025-10-10 00:02

Core Insights - The article discusses the recent research by teams from Hong Kong University of Science and Technology, University of Waterloo, and Tsinghua University, which reveals that large language models (LLMs) learn reasoning in a human-like manner by separating high-level strategy planning from low-level execution [3][10][12]. Group 1: Reinforcement Learning and LLMs - Reinforcement Learning (RL) enhances the reasoning capabilities of LLMs, although the underlying mechanisms have not been clearly understood until now [2][5]. - The research highlights the importance of RL in enabling models to exhibit reflective behaviors during interactions with the RL environment [7][10]. - Two significant experimental clues are identified: "length scaling effect" and "aha moment," indicating that LLMs can learn to use more thinking time to solve reasoning tasks [8][9][10]. Group 2: Learning Dynamics - The study outlines a two-phase learning dynamic in LLMs during RL training: the first phase focuses on consolidating basic execution skills, while the second phase shifts towards exploring high-level planning strategies [14][22]. - In the first phase, the model's focus is on mastering low-level operations, which is marked by a decrease in the uncertainty of execution tokens [23][24]. - The second phase involves the model actively expanding its strategy planning library, which correlates with improved reasoning accuracy and longer solution chains [28][30]. Group 3: HICRA Algorithm - The research introduces a new algorithm called HICRA (Hierarchy-Aware Credit Assignment), which emphasizes the learning of planning tokens over execution tokens to enhance reasoning capabilities [18][42]. - HICRA consistently outperforms mainstream methods like GRPO, particularly when the model has a solid foundation in execution skills [20][45]. - Experimental results show that HICRA leads to significant improvements in various reasoning benchmarks compared to GRPO, indicating its effectiveness in optimizing planning tokens [46][47]. Group 4: Insights on Token Dynamics - The study reveals that the observed phenomena, such as "aha moments" and "length scaling," are not random but are indicative of a structured learning process [33][35]. - The overall token-level entropy decreases as the model becomes more predictable in executing low-level tasks, while the semantic entropy of planning tokens increases, reflecting the model's exploration of new strategies [39][40]. - The findings suggest that the key to enhancing reasoning capabilities lies in improving planning abilities rather than merely optimizing execution details [20][41].

Reinforcement Learning

Hierarchical Reasoning

Artificial Intelligence

DeepSeek-R1-Zero

Reinforcement Learning

Hierarchical Reasoning

Artificial Intelligence

DeepSeek-R1-Zero

CoreWeave Launches First Publicly Available Serverless Reinforcement Learning Capability to Build Reliable AI Agents

Businesswire· 2025-10-08 17:00

Core Idea - CoreWeave, Inc. has launched Serverless RL, a fully managed reinforcement learning capability that simplifies the training of AI agents [1] Product Features - Serverless RL allows seamless scaling to dozens of GPUs, enhancing the training process for AI agents [1] - The service requires only a Weights & Biases account and API key to initiate, lowering the entry barriers for developers [1] - It provides faster feedback loops, improving the efficiency of AI training [1]

CoreWeave Inc-A(US:CRWV)

Reinforcement Learning

Artificial Intelligence

Reinforcement Learning

Artificial Intelligence