Reinforcement Learning
Search documents
X @Avi Chawla
Avi Chawla· 2025-10-23 20:02
Core Concept of Memento - Memento reframes continual learning as memory-based online reinforcement learning over a memory-augmented MDP, learning from experiences using memory instead of updating LLM weights [2] - Memento aims to improve AI agent performance from experience without fine-tuning LLM weights [1] Key Components - Case-Based Reasoning (CBR) decomposes complex tasks into sub-tasks and retrieves relevant past experiences [2] - Executor executes each subtask using MCP tools and records outcomes in memory for future reference [3] MCP Tools - MCP tools enable the executor to accomplish most real-world tasks [3] - MCP tools include Web research, Document handling, Safe Python execution, Data analysis, and Media processing [3]
拆电脑比装电脑还难?这只“手术级”机械手正在破解电子垃圾困局
机器人大讲堂· 2025-10-23 14:37
Core Viewpoint - The article discusses the challenges and innovations in the recycling of electronic waste, particularly focusing on the development of a specialized robotic claw, DeGrip, designed for dismantling electronic devices efficiently and effectively [1][26]. Group 1: Technological Challenges - Electronic waste (EOL) dismantling is a crucial part of the circular economy, but it is technically challenging due to the complexity and variability of different manufacturers' products [1]. - Traditional industrial robots excel in assembly but are rarely used in dismantling due to their limited flexibility in confined spaces [2][4]. Group 2: Innovation in Robotics - DeGrip is a newly designed robotic claw that combines small size and high flexibility, allowing it to operate in tight spaces within electronic devices [4][5]. - The claw features three degrees of freedom (DOF), enabling it to perform complex dismantling tasks with precision [5][11]. - The use of a cable-driven mechanism allows for a compact design that can navigate tight spaces while maintaining efficiency [6][7]. Group 3: Simulation and Testing - Prior to physical testing, DeGrip was evaluated in a virtual environment using a digital model of a desktop computer to assess its performance in dismantling tasks [12][20]. - The simulation tasks included removing RAM modules, SSDs, and HDDs from confined spaces, demonstrating DeGrip's adaptability and precision [14][16][18][20]. Group 4: Prototype Development - A physical prototype of DeGrip was created using 3D printing and tested in real-world scenarios, confirming its structural integrity and responsiveness [22][24]. - The prototype's performance validated the reliability of the cable-driven design and its feasibility for practical applications [24]. Group 5: Future Directions - The next phase involves using DeGrip to gather operational data for developing intelligent learning systems, enabling robots to learn autonomous dismantling strategies [26]. - This innovation aims to enhance the efficiency of electronic waste recycling, contributing to a more sustainable circular economy [27].
从几个代表性的工作分析强化学习和VLA是怎么结合的?挑战有哪些?
具身智能之心· 2025-10-22 03:04
Core Insights - The article discusses the integration of reinforcement learning (RL) with Visual-Language-Action (VLA) models to enhance robotic capabilities, enabling robots to understand visual and linguistic instructions while optimizing their actions through trial and error [2][8]. Group 1: VLA and Reinforcement Learning Integration - The combination of VLA models and RL allows robots to interpret tasks and adjust their actions based on feedback, improving their performance in complex environments [2][3]. - The GRAPE framework enhances the generalization of robotic policies by aligning preferences, breaking down complex tasks into manageable stages, and optimizing actions through RL, resulting in a success rate increase of 51.79% for seen tasks and 58.20% for unseen tasks [6][7]. Group 2: Addressing Generalization Challenges - VLA models struggle with generalization in unfamiliar scenarios; however, the VLA-RL framework models the robotic operation as a multi-turn dialogue, achieving higher success rates in 40 complex tasks compared to pure imitation learning [8][10]. - The ReWiND framework generates flexible reward functions through language descriptions, allowing robots to adapt to new tasks with a learning efficiency that is twice as fast in simulations and five times faster in real-world applications [12][14]. Group 3: Fine-Tuning Strategies - The ConRFT framework combines offline and online fine-tuning methods, achieving an average success rate of 96.3% across eight real-world tasks, significantly improving performance compared to traditional supervised learning [15][18]. - The Dual-Actor framework utilizes a pre-trained VLA model to master basic actions before fine-tuning through RL, enhancing the robot's ability to perform complex assembly tasks with higher success rates [20][22]. Group 4: Safety and Efficiency - Safety mechanisms are integrated into RL processes to prevent collisions and damage during robotic exploration, ensuring a secure and efficient learning environment [23][24]. - The article emphasizes the importance of designing efficient multi-modal encoders to address the challenges of integrating visual, linguistic, and action data, which can lead to information loss [27][28].
自动驾驶论文速递!VLA、世界模型、强化学习、轨迹规划等......
自动驾驶之心· 2025-10-18 04:00
Core Insights - The article discusses advancements in autonomous driving technologies, highlighting various research contributions and their implications for the industry. Group 1: DriveVLA-W0 - The DriveVLA-W0 training paradigm enhances the generalization ability and data scalability of VLA models by using world modeling to predict future images, achieving 93.0 PDMS and 86.1 EPDMS on NAVSIM benchmarks [6][12] - A lightweight Mixture-of-Experts (MoE) architecture reduces inference latency to 63.1% of the baseline VLA, meeting real-time deployment needs [6][12] - The data scaling law amplification effect is validated, showing significant performance improvements as data volume increases, with a 28.8% reduction in ADE and a 15.9% decrease in collision rates when using 70M frames [6][12] Group 2: CoIRL-AD - The CoIRL-AD framework combines imitation learning and reinforcement learning within a latent world model, achieving an 18% reduction in collision rates on the nuScenes dataset and a PDMS score of 88.2 on the Navsim benchmark [13][16] - The framework integrates RL into an end-to-end autonomous driving model, addressing offline RL's scene expansion issues [13][16] - A decoupled dual-policy architecture facilitates structured interaction between imitation learning and reinforcement learning, enhancing knowledge transfer [13][16] Group 3: PAGS - The Priority-Adaptive Gaussian Splatting (PAGS) framework achieves high-quality real-time 3D reconstruction in dynamic driving scenarios, with a PSNR of 34.63 and SSIM of 0.933 on the Waymo dataset [23][29] - PAGS incorporates semantic-guided pruning and regularization to balance reconstruction fidelity and computational cost [23][29] - The framework demonstrates a rendering speed of 353 FPS with a training time of only 1 hour and 22 minutes, outperforming existing methods [23][29] Group 4: Flow Planner - The Flow Planner achieves a score of 90.43 on the nuPlan Val14 benchmark, marking the first learning-based method to surpass 90 without prior knowledge [34][40] - It introduces fine-grained trajectory tokenization to enhance local feature extraction while maintaining motion continuity [34][40] - The architecture employs adaptive layer normalization and scale-adaptive attention to filter redundant information and strengthen key interaction information extraction [34][40] Group 5: CymbaDiff - The CymbaDiff model defines a new task for sketch-based 3D outdoor semantic scene generation, achieving a FID of 40.74 on the Sketch-based SemanticKITTI dataset [44][47] - It introduces a large-scale benchmark dataset, SketchSem3D, for evaluating 3D semantic scene generation [44][47] - The model employs a Cylinder Mamba diffusion mechanism to enhance spatial coherence and local neighborhood relationships [44][47] Group 6: DriveCritic - The DriveCritic framework utilizes vision-language models for context-aware evaluation of autonomous driving, achieving a 76.0% accuracy in human preference alignment tasks [55][58] - It addresses limitations of existing evaluation metrics by focusing on context sensitivity and human alignment in nuanced driving scenarios [55][58] - The framework demonstrates superior performance compared to traditional metrics, providing a reliable solution for human-aligned evaluation in autonomous driving [55][58]
「重要性采样」并不「重要」?快手清华ASPO攻克重要性采样权重错配
量子位· 2025-10-15 10:20
Core Insights - Reinforcement Learning (RL) has become a crucial component in the post-training phase of Large Language Models (LLMs) like ChatGPT and DeepSeek [1] - A significant issue has emerged with the increasing scale of model parameters: the importance sampling (IS) mechanism may not be as beneficial as previously thought [2][5] - The research team from Kuaishou and Tsinghua University identified a deep-rooted "weight mismatch" phenomenon in existing supervised RL paradigms, leading to overconfidence in models and potential issues like entropy collapse and premature convergence [2][6] Importance Sampling Issues - Importance sampling is intended to correct the distribution differences between old and new policies, allowing models to reuse old data without deviating from the target distribution [5] - In small-scale RL, IS is effective; however, it fails in the context of supervised RL for large language models [6] - Experiments showed that in GRPO algorithms, IS did not provide the expected benefits and instead contributed to training instability [7] Weight Mismatch and Self-Reinforcing Loops - The research revealed that the advantage values in supervised RL are inaccurate, as different tokens contribute differently to the final answer [8] - The average IS weight for positive advantage tokens is higher than for negative ones, leading to a decrease in entropy [9] - IS in supervised RL algorithms has shifted from being a correction term to a token-level weight, causing a self-reinforcing loop that reinforces high-scoring tokens while neglecting low-probability ones [11][12] ASPO Algorithm Introduction - The proposed ASPO (Asymmetric Importance Sampling Policy Optimization) algorithm addresses these issues by inverting the IS weights for positive advantage tokens, allowing low-probability tokens to receive stronger updates [3][18] - ASPO incorporates a Dual-Clipping mechanism to manage extreme values resulting from the inverted weights, ensuring stability while maintaining effective gradient flow [20] Experimental Results - ASPO demonstrated significant advantages in various benchmarks, including mathematical reasoning and code generation tasks, outperforming traditional methods [24] - The average performance improvement was 12.5% for mathematical tasks and 17.0% for code generation tasks, with smoother training curves and reduced entropy collapse [26] - ASPO achieved notable results in the LiveCodeBench v5 benchmark, indicating its superiority over mainstream RL methods [26][27]
开源编程模型王座易主了,谁能想到新SOTA是快手
量子位· 2025-10-11 06:04
Core Insights - The article highlights the emergence of KAT-Dev-72B-Exp from Kuaishou as the leading open-source programming model, achieving a score of 74.6% on the SWE-Bench certification leaderboard [1][4]. Group 1: Model Performance - KAT-Dev-72B-Exp is an experimental reinforcement learning version of the KAT-Coder model, which has also outperformed GPT-5 (non-Codex mode) and Claude 4 Sonnet on the SWE-Bench certification [3][4]. - KAT-Coder demonstrates capabilities such as recreating a complete version of the game "Fruit Ninja" within a web environment, including scoring and life systems [6]. Group 2: Visualization and Interaction - The model excels in visualizing physical laws through code, with examples including a cyberpunk clock that triggers explosion effects and a solar system simulation created using three.js [10][13]. - KAT-Coder can generate interactive effects and animations that adhere to real physical principles, such as a 60-story building collapse simulation [15]. Group 3: Key Technologies - KAT-Coder employs multiple training phases, including mid-training, supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), leading to emergent behaviors in the model [17][25]. - The model's interaction count required to complete tasks decreased by 32% after reinforcement learning, indicating improved efficiency [26]. Group 4: Industrial-Grade Framework - Kuaishou's self-developed industrial-grade reinforcement learning framework, SeamlessFlow, supports complex scenarios like multi-agent and online reinforcement learning [28][29]. - SeamlessFlow has shown a 100% throughput improvement in single-round RL tasks and a 62% reduction in overall training time compared to mainstream VERL frameworks [35]. Group 5: Training Optimization - The introduction of Trie Packing mechanism and the restructuring of the training engine allow KAT-Dev-72B-Exp to efficiently train on shared prefix trajectories, achieving an average speed increase of 2.5 times [37].
不是玄学!港科大清华等联手:撕开推理黑箱,RL让AI像人思考
具身智能之心· 2025-10-10 00:02
Core Insights - The article discusses the recent research by teams from Hong Kong University of Science and Technology, University of Waterloo, and Tsinghua University, which reveals that large language models (LLMs) learn reasoning in a human-like manner by separating high-level strategy planning from low-level execution [3][10][12]. Group 1: Reinforcement Learning and LLMs - Reinforcement Learning (RL) enhances the reasoning capabilities of LLMs, although the underlying mechanisms have not been clearly understood until now [2][5]. - The research highlights the importance of RL in enabling models to exhibit reflective behaviors during interactions with the RL environment [7][10]. - Two significant experimental clues are identified: "length scaling effect" and "aha moment," indicating that LLMs can learn to use more thinking time to solve reasoning tasks [8][9][10]. Group 2: Learning Dynamics - The study outlines a two-phase learning dynamic in LLMs during RL training: the first phase focuses on consolidating basic execution skills, while the second phase shifts towards exploring high-level planning strategies [14][22]. - In the first phase, the model's focus is on mastering low-level operations, which is marked by a decrease in the uncertainty of execution tokens [23][24]. - The second phase involves the model actively expanding its strategy planning library, which correlates with improved reasoning accuracy and longer solution chains [28][30]. Group 3: HICRA Algorithm - The research introduces a new algorithm called HICRA (Hierarchy-Aware Credit Assignment), which emphasizes the learning of planning tokens over execution tokens to enhance reasoning capabilities [18][42]. - HICRA consistently outperforms mainstream methods like GRPO, particularly when the model has a solid foundation in execution skills [20][45]. - Experimental results show that HICRA leads to significant improvements in various reasoning benchmarks compared to GRPO, indicating its effectiveness in optimizing planning tokens [46][47]. Group 4: Insights on Token Dynamics - The study reveals that the observed phenomena, such as "aha moments" and "length scaling," are not random but are indicative of a structured learning process [33][35]. - The overall token-level entropy decreases as the model becomes more predictable in executing low-level tasks, while the semantic entropy of planning tokens increases, reflecting the model's exploration of new strategies [39][40]. - The findings suggest that the key to enhancing reasoning capabilities lies in improving planning abilities rather than merely optimizing execution details [20][41].
CoreWeave Launches First Publicly Available Serverless Reinforcement Learning Capability to Build Reliable AI Agents
Businesswire· 2025-10-08 17:00
Core Idea - CoreWeave, Inc. has launched Serverless RL, a fully managed reinforcement learning capability that simplifies the training of AI agents [1] Product Features - Serverless RL allows seamless scaling to dozens of GPUs, enhancing the training process for AI agents [1] - The service requires only a Weights & Biases account and API key to initiate, lowering the entry barriers for developers [1] - It provides faster feedback loops, improving the efficiency of AI training [1]
X @TechCrunch
TechCrunch· 2025-10-05 15:05
AI tasks that work well with reinforcement learning are getting better fast — and threatening to leave the rest of the industry behind. https://t.co/lFT3lyvg4o ...
Anthropic CEO: AGI Is Marketing
Alex Kantrowitz· 2025-09-30 16:58
Terminology Analysis - The company views terms like AGI (Artificial General Intelligence) and super intelligence as potentially meaningless and more akin to marketing terms [1][2] - The company publicly avoids using AGI and super intelligence, and is critical of their use [2] AI Development & Scaling - The company is bullish on the rapid improvement of AI capabilities, emphasizing the exponential progress in the field [3] - AI model improvement occurs every few months due to increased investment in compute, data, and new training models [3] - AI model training involves pre-training (feeding data from the internet) and a second stage involving reinforcement learning [4] - Both pre-training and reinforcement learning are scaling up together, with no apparent barriers to further scaling [5]