Reinforcement Learning
Search documents
Emergent Behavior in Autonomous Driving with Wayve CEO Alex Kendall
Sequoia Capital· 2025-11-18 17:01
Reasoning in the physical world can be really well expressed as a world model. In 2018, we put our very first world model approach on the road. It was a very small 100,000 parameter neural network that could simulate a 30x3 pixel image of a road in front of us.But we were able to use it as this internal simulator to train a modelbased reinforcement learning algorithm. Fast forward to today and we've developed a GIA. It's a full generative world model that's able to simulate multiple camera and sensors and v ...
RLinf上新πRL:在线强化学习微调π0和π0.5
机器之心· 2025-11-06 08:58
Core Insights - The article discusses the advancements in the field of robotics, particularly focusing on the VLA models π0 and π0.5 developed by Physical Intelligence, which utilize flow matching techniques to generate high-dimensional and smooth continuous action sequences, demonstrating significant advantages in complex manipulation tasks [2][3]. Group 1: VLA Models and Challenges - VLA models heavily rely on large-scale, high-quality human demonstration data, which is costly and time-consuming to collect and annotate [2]. - Reinforcement learning (RL) allows agents to explore and iteratively improve through real interactions with the environment, reducing the dependency on extensive data and enhancing the performance ceiling of supervised fine-tuning (SFT) [2]. Group 2: πRL Framework - A collaborative effort from institutions like Tsinghua University, Peking University, and CMU has led to the development of the πRL framework for online reinforcement learning fine-tuning of flow matching VLA models [3]. - The πRL framework achieved an average success rate of 97.6% for π0 and 98.3% for π0.5 on the LIBERO testing platform, validating the effectiveness of the fine-tuning approach [3]. Group 3: Technical Innovations - πRL introduces two technical routes: Flow-Noise and Flow-SDE, addressing the challenge of directly calculating the log-likelihood of output actions in flow matching VLA [8][10]. - Flow-Noise models the denoising process as a discrete Markov process, enabling the direct computation of the joint probability density of the denoised sequence [10]. - Flow-SDE combines the denoising process with environmental interaction, constructing a two-layer Markov Decision Process (MDP) [20]. Group 4: Performance Improvements - The πRL framework demonstrated a success rate increase of over 40% across 4,352 grasp-and-place task combinations, achieving final success rates exceeding 80% [3][24]. - In the LIBERO testing platform, πRL improved the average success rate of π0 from 57.6% to 97.6% and π0.5 from 77.1% to 98.3%, surpassing the performance of fully data-trained flow matching VLAs [19]. Group 5: Generalization and Robustness - The πRL algorithm significantly enhances the generalization capabilities of both models in new environments, as evidenced by tests involving domain randomization [26]. - The framework's ability to reduce the average number of steps required to complete tasks indicates improved efficiency compared to supervised fine-tuning [28]. Group 6: Future Directions - Future developments of πRL will include more benchmark tests, deeper analysis of out-of-distribution (OOD) generalization capabilities, and further exploration of critic design for improved stability [35][36].
X @Nick Szabo
Nick Szabo· 2025-11-06 05:37
RT TuringPost (@TheTuringPost).@karpathy's nanochat is bigger that you thinkHe calls it a ramp, but it's actually a lab of its own – a miniature system where anyone can experimentAnd most importantly – it’s deeply connected to education, allowing us to understand machine intelligence through a tiny model:1. What is nanochat and how you can use it?It's a miniature LM that costs anything from $100 (~4 hours on an 8XH100 node) to train and behaves like a small, curious creature.Karpathy described it as a “kind ...
X @Demis Hassabis
Demis Hassabis· 2025-10-30 17:44
RT Tom Zahavy (@TZahavy)I am excited to share a work we did in the Discovery team at @GoogleDeepMind using RL and generative models to discover creative chess puzzles 🔊♟️♟️ #neurips2025🎨While strong chess players intuitively recognize the beauty of a position, articulating the precise elements that constitute creativity remains elusive. To address this, we pre-trained generative models on public datasets and then applied reinforcement learning, using novel rewards designed for uniqueness, counter-intuitiven ...
X @Wu Blockchain
Wu Blockchain· 2025-10-30 12:03
Avalon Labs released a whitepaper introducing an on-chain AI-powered RWA marketplace and the AI-MaaS concept. They launched a Reinforcement Learning model on BNB Chain and a new RWA tokenization standard, CRT, granting legal commercial rights. The project is backed by YZi Labs and Framework Ventures. https://t.co/1FOoXvX9Od ...
The next ‘golden age’ of AI investment
Fortune· 2025-10-30 10:48
Core Insights - The recent Fortune Global Forum in Riyadh highlighted discussions on the transformative impact of artificial intelligence across various industries, featuring prominent speakers from major companies [1] - Anjney Midha from Andreessen Horowitz identified a new "golden age" of investment opportunities in AI, driven by the emergence of innovative frontier teams [2] - Midha emphasized the significance of reasoning models in AI, which enhance problem-solving capabilities by mimicking logical reasoning and reflection [3] - The potential of reinforcement learning in creating multibillion-dollar companies was discussed, particularly when startups deeply understand industry-specific challenges [4] - Despite concerns about a potential AI bubble, investment in the sector continues to surge, with significant funding levels reported [5] Investment Trends - Venture capital investment in generative AI is projected to exceed $73.6 billion in 2025, more than doubling from the previous year, with total investment in the AI ecosystem reaching $110.17 billion, an eightfold increase since 2019 [6] - Major foundation model providers like OpenAI, Anthropic, and Mistral AI are attracting substantial funding, with OpenAI securing $40 billion, Anthropic $13 billion, and Mistral €1.7 billion [7] Industry Developments - The Cyber 60 list, ranking promising cybersecurity startups, showcases new entrants developing tools to combat AI threats, alongside established companies expanding their customer bases [8]
Cursor发布首个编程大模型!代码生成250tokens/秒,强化学习+MoE架构
量子位· 2025-10-30 01:06
Core Insights - Cursor has officially released its first in-house coding model, named Composer, as part of the Cursor 2.0 update [1][2] - Composer is reported to complete complex tasks in just 30 seconds, achieving a speed increase of 400% compared to competitors [3][12] Model Features - The new Cursor 2.0 includes a native browser tool that allows the model to test, debug, and iterate code autonomously until achieving correct results [4] - Voice code generation enables users to convert their thoughts into code without typing [5] - The interface has shifted from a file-centric to an agent-centric model, allowing multiple agents to run simultaneously without interference [6][7] Performance Metrics - Composer generates code at a speed of 250 tokens per second, which is approximately twice as fast as the current leading models like GPT-5 and Claude Sonnet 4.5 [19][20] - The model demonstrates enhanced reasoning and task generalization capabilities, comparable to mid-tier leading models [21] Training Methodology - Composer's performance is attributed to reinforcement learning, which allows the model to learn from real programming tasks rather than static datasets [22][26] - The training process involves the model working directly within a complete codebase, utilizing production-level tools to write, test, and debug code [27][28] Practical Application - Cursor 2.0 is designed to provide a practical AI system that aligns closely with developers' daily workflows, enhancing its usability in real-world scenarios [35][36] - The model has shown emergent behaviors, such as running unit tests and autonomously fixing code format errors [31] Transparency and Model Origin - There are concerns regarding the transparency of Composer's foundational model, with questions about whether it is based on pre-existing models or entirely self-trained [37][40] - Cursor has previously developed an internal model named Cheetah, which was used for testing speed and system integration [42]
单条演示即可抓取一切:北大团队突破通用抓取,适配所有灵巧手本体
3 6 Ke· 2025-10-29 08:55
Core Insights - The article discusses the introduction of the DemoGrasp framework, a novel approach to robotic grasping that addresses challenges in traditional reinforcement learning (RL) methods, particularly in high-dimensional action spaces and complex reward functions [1][4][6]. Group 1: Framework Overview - DemoGrasp is designed to enhance the efficiency of grasping tasks by utilizing a single successful demonstration trajectory as a starting point, allowing for trajectory editing to adapt to various objects and poses [4][8]. - The framework transforms multi-step Markov Decision Processes (MDP) into a single-step MDP based on trajectory editing, significantly improving learning efficiency and performance transfer to real robots [4][6]. Group 2: Learning Process - The learning process involves editing the trajectory of a successful grasp to accommodate new objects, where adjustments to wrist and finger positions are made to fit unseen items [8][12]. - DemoGrasp employs a simulation environment with thousands of parallel worlds to train the policy network, achieving over 90% success rate after 24 hours of training on a single RTX 4090 GPU [8][10]. Group 3: Performance Metrics - In experiments using the DexGraspNet dataset, DemoGrasp outperformed existing methods, achieving a visual policy success rate of 92% with only a 1% generalization gap between training and testing datasets [10][13]. - The framework demonstrated adaptability across various robotic forms, achieving an average success rate of 84.6% on 175 different objects without adjusting training hyperparameters [14][15]. Group 4: Real-World Application - In real-world tests, DemoGrasp successfully grasped 110 unseen objects with a success rate exceeding 90% for regular-sized items and 70% for challenging flat and small objects [15][16]. - The framework supports complex grasping tasks in cluttered environments, maintaining an 84% success rate for single-instance real-world grabs despite significant variations in lighting and object placement [16][17].
3B Image Captioning小钢炮重磅来袭,性能比肩Qwen2.5-VL-72B
机器之心· 2025-10-28 04:31
Core Insights - The article introduces a new technology in Dense Image Captioning called CapRL (Captioning Reinforcement Learning), which successfully applies reinforcement learning methods to image captioning tasks, redefining the reward system based on practicality [2][6][10] - The CapRL-3B model achieves captioning performance comparable to Qwen2.5-VL-72B, marking a significant advancement in the field of image captioning and providing important insights for applying GRPO strategies to open tasks [2][12] Summary by Sections Introduction to CapRL - CapRL is a novel approach that addresses the challenge of designing rewards for subjective image description tasks by defining objective verifiable rewards based on practicality [6][10] - The model has been trained to generate high-quality captions that improve upon previous methods, avoiding issues like reward hacking [8][10] Limitations of Existing Methods - Most current image captioning models rely on supervised fine-tuning (SFT), which has limitations such as high costs and lack of generalization due to dependence on large, manually annotated datasets [7][8] - The subjective nature of image descriptions complicates the design of reliable reward functions, leading to potential issues in model training [7][8] CapRL Framework - The CapRL framework employs a two-stage decoupled training process where a language model answers visual questions based on generated captions, using the accuracy of these answers as an objective reward signal [10][13] - This innovative approach significantly enhances the quality of generated captions, improving accuracy and detail coverage while reducing hallucinations [10][11] Experimental Results - The CapRL-3B model was evaluated on the CapRL-5M dataset, showing significant performance improvements across 12 benchmark tests compared to previous models like ShareGPT4V and DenseFusion [12][14] - In direct assessments of caption quality, CapRL-3B's performance is comparable to that of larger models, demonstrating an average improvement of 8.4% over baseline models [12][15] Conclusion and Future Work - The CapRL framework has been open-sourced, with ongoing iterations to enhance its capabilities, inviting further use and exploration by the community [12][19]
DeepMind再登Nature:AI Agent造出了最强RL算法
3 6 Ke· 2025-10-28 00:35
Core Insights - The main objective of artificial intelligence (AI) is to design agents capable of autonomously predicting, acting, and achieving goals in complex environments. The challenge has been to enable these agents to independently develop efficient reinforcement learning (RL) algorithms [1][2]. Group 1: Discovery Methodology - Google DeepMind introduced a method called DiscoRL, which allows agents to autonomously discover RL rules through interactions in various environments. This method outperformed existing RL algorithms in both known and challenging benchmark tests [1][2]. - The discovery process involves two types of optimization: agent optimization and meta-optimization. Agents optimize their parameters by updating their strategies and predictions, while the meta-network optimizes the goals of the RL rules to maximize cumulative rewards [3][5]. Group 2: Performance Evaluation - DiscoRL was evaluated using the interquartile mean (IQM) as a performance metric, demonstrating superior performance over existing RL algorithms like MuZero and Dreamer in the Atari benchmark tests [7][8]. - The Disco57 rule, trained on 57 Atari games, achieved an IQM score of 13.86, surpassing all current RL rules and showing significant efficiency improvements over MuZero [8][14]. Group 3: Generalization and Robustness - The generalization capability of Disco57 was tested across 16 independent benchmark tests, outperforming all published methods, including MuZero and PPO. It also showed competitive performance in the Crafter benchmark and ranked third in the NetHack NeurIPS 2021 challenge without using domain-specific knowledge [9][11]. - Disco103, discovered in 103 environments, demonstrated comparable performance to Disco57 in Atari benchmarks and reached human-level performance in the Crafter benchmark, indicating that more complex and diverse environments lead to stronger and more generalizable RL rules [11][14]. Group 4: Efficiency and Scalability - The optimal performance of Disco57 was achieved within approximately 600 million steps per game, significantly more efficient than traditional human-designed RL rules, which require more experimental iterations and time [14][18]. - The performance of the discovered RL rules improved with the increase in the number of training environments, suggesting that the effectiveness of the discovered RL is dependent on the data (environments) and computational resources available [14][17].