强化学习
Search documents
对比学习视角,GRPO即DPO?
自动驾驶之心· 2025-10-18 16:03
Core Insights - The article discusses the development of efficient GRPO (Generalized Reinforcement Policy Optimization) and its implications for reinforcement learning, highlighting the challenges and breakthroughs encountered during the research process [1][2]. Group 1: Research Development - The initial focus was on improving the speed of GRPO, with an emphasis on sampling efficiency, which is a common challenge in reinforcement learning [2][3]. - The author experimented with tree-based sampling methods but found that they did not yield the expected improvements in efficiency [3]. - A second approach involved "speculative sampling," which aimed to exit upon obtaining a correct sample, but faced implementation challenges that hindered performance [3][4]. Group 2: Methodological Innovations - The third approach utilized historical data to estimate the correctness of prompts, leading to a more efficient sampling strategy based on Bayesian methods [4]. - Experiments showed that reducing the number of rollouts per prompt did not significantly impact performance, indicating robustness in the methodology [4][5]. - The exploration of contrastive learning principles led to insights about the relationship between DPO (Direct Policy Optimization) and GRPO, suggesting potential avenues for further research [5]. Group 3: Community and Collaboration - The article emphasizes the importance of community engagement in advancing research, highlighting the role of discussions and collaborations in refining ideas and methodologies [8][10]. - The establishment of a comprehensive community focused on large model technologies aims to facilitate knowledge sharing and collaboration across various domains, including academic research and practical applications [9][10].
【红杉:AI至少是每年10万亿的机会】AI的五大趋势与人类的新分工
老徐抓AI趋势· 2025-10-18 13:24
Core Insights - Sequoia Capital emphasizes that AI is not merely a software revolution but a labor revolution, targeting the $10 trillion labor market rather than the $650 billion software market [2][8] - The historical context of software development shows that AI is creating new markets similar to how SaaS transformed the software industry [5][7] AI as a Labor Revolution - AI aims to replace certain labor functions rather than just enhance software capabilities, with a focus on sectors like customer service, administration, sales, financial analysis, and education [8] - The current automation level of AI in the U.S. service industry is less than 0.2%, indicating significant potential for growth [8] Comparison with Historical Innovations - The AI revolution is likened to the Industrial Revolution, where the true impact came from the establishment of factory systems rather than the invention of steam engines [10][11] - The development of AI infrastructure, akin to the assembly line in manufacturing, is crucial for widespread adoption and efficiency [12] Future Trends in AI - Sequoia identifies five key trends for AI: enhancing efficiency while accepting uncertainty, the rise of reinforcement learning, the integration of AI into the physical world, the shift in productivity metrics towards computational power, and the need for companies to adapt to these changes [13][14] - The demand for computational power is expected to increase dramatically, creating new opportunities for infrastructure providers [14] Implications for Businesses and Individuals - Companies that can effectively utilize AI will have a competitive edge, while those that do not adapt may face obsolescence [14] - The future workforce will be smaller and more efficient, with a focus on collaboration with AI rather than traditional labor roles [12][14]
卡帕西:强化学习很糟糕,但其他所有方法都更糟
量子位· 2025-10-18 09:30
Group 1 - The core viewpoint of the article is that achieving Artificial General Intelligence (AGI) will take at least another decade, as current AI systems need significant improvements to reach their full potential [5][10][28] - Karpathy emphasizes that existing AI systems lack maturity, multi-modal capabilities, and the ability to learn continuously, which are essential for them to function effectively in collaboration with humans [8][9][10] - He critiques the current state of Large Language Models (LLMs), stating that they have cognitive deficiencies and overestimate their capabilities, requiring substantial enhancements [16][18] Group 2 - Karpathy argues that reinforcement learning is more flawed than commonly perceived, as it reinforces all steps taken in reaching a correct answer, regardless of their validity, leading to inefficient learning [20][21][23] - He believes that AGI will not lead to a sudden leap in productivity but will follow a gradual growth pattern, similar to the historical 2% GDP growth trend observed with the internet [25][29] - The lengthy development of autonomous driving technology is attributed to the high stakes involved, where even minor errors can have severe consequences, necessitating extensive reliability improvements [30][32][33] Group 3 - As a full-time educator, Karpathy aims to establish a leading-edge educational institution that offers a unique mentorship experience, focusing on personalized learning and advanced AI education [34][36] - He highlights the importance of tailored teaching methods, which current LLMs cannot replicate, emphasizing the need for human instructors to provide appropriate challenges to students [36][38]
稳定训练、数据高效,清华大学提出「流策略」强化学习新方法SAC Flow
机器之心· 2025-10-18 05:44
Core Insights - The article introduces a new scheme for training flow-based policies using a high data efficiency reinforcement learning algorithm called SAC, which optimizes real flow policies end-to-end without the need for surrogate objectives or policy distillation [2][10]. Group 1: Research Background - Flow-based policies have gained popularity in the field of robotic learning due to their ability to model multi-modal action distributions and their simplicity compared to diffusion policies, leading to their widespread application in advanced VLA models [4]. - Previous attempts to train flow policies using on-policy RL algorithms have faced challenges, particularly when using data-efficient off-policy RL methods like SAC, which often result in instability due to gradient explosion during multi-step sampling [4][5]. Group 2: Methodology - The proposed approach views the training of flow policies as equivalent to training a recurrent neural network (RNN), allowing the use of modern recurrent structures like GRU and Transformer to stabilize training [7][11]. - SAC Flow incorporates Gaussian noise and drift correction in each rollout to ensure the end action distribution remains unchanged, allowing the actor/critic loss of SAC to be expressed using the log-likelihood of multi-step sampling from the flow policy [15]. Group 3: Training Paradigms - Two training paradigms are supported: - From-scratch training for dense-reward tasks, where SAC Flow can be trained directly [16]. - Offline-to-online training for sparse-reward tasks, where pre-training on a dataset is followed by online fine-tuning [19]. Group 4: Experimental Results - In experiments, both Flow-G and Flow-T achieved state-of-the-art performance in the Mujoco environment, demonstrating stability and high sample efficiency [22][24]. - The results indicate that SAC Flow is robust to the number of sampling steps (K), maintaining stable training across various K values, with Flow-T showing particularly strong robustness [30]. Group 5: Comparison with Similar Works - Unlike FQL/QC-FQL, which distill flow policies into single-step models before off-policy RL training, SAC Flow retains the modeling capabilities of flow policies without distillation [33]. - SAC Flow-T and Flow-G exhibited faster convergence and higher final returns across various environments compared to diffusion policy baselines and other flow-based methods [34][35]. Group 6: Conclusion - The key attributes of SAC Flow are serialization, stable training, and data efficiency, leveraging the experience of GRU and Transformer structures to stabilize gradient backpropagation [37].
Andrej Karpathy 开炮:智能体都在装样子,强化学习很糟糕,AGI 十年也出不来
机器之心· 2025-10-18 05:44
Core Viewpoint - AI is projected to contribute an annual GDP increase of 2%, but the current state of the industry is criticized for being overly optimistic and disconnected from reality [2][5]. Group 1: AGI and Learning - AGI is expected to take about ten years to develop, as current AI agents lack the necessary cognitive abilities and continuous learning capabilities [9][11]. - Current AI models, particularly large language models (LLMs), exhibit cognitive deficiencies that hinder their performance [34][36]. - The concept of reinforcement learning is deemed inadequate for replicating human learning processes, as it oversimplifies the complexity of human decision-making [44][46]. Group 2: AI Development and Challenges - The industry is experiencing a phase of rapid development, but there is skepticism about the actual capabilities of AI models, which are often overhyped [5][41]. - Current AI agents struggle with understanding and integrating unique coding implementations, leading to inefficiencies and misunderstandings in code generation [36][41]. - The reliance on pre-trained models and the limitations of current AI tools highlight the need for further advancements in AI technology [20][42]. Group 3: Future of AI - The future of AI is expected to involve more sophisticated attention mechanisms and potentially a shift towards more efficient learning algorithms [29][30]. - There is a belief that while AI will continue to evolve, it will still rely on foundational principles such as gradient descent for training large neural networks [29][30]. - The ongoing improvements in AI tools and models suggest a continuous integration of new techniques and methodologies to enhance performance [42][43].
VLA可以赋于强化学习更智能的场景应用......
具身智能之心· 2025-10-17 04:01
Core Insights - The article discusses the importance of reinforcement learning (RL) in the development of embodied intelligent robots, highlighting its applications in various complex tasks such as stair climbing, running, and dancing [3][9] - It emphasizes the challenges faced by newcomers in the field of reinforcement learning, particularly in producing quality research papers due to the complexity and breadth of the subject [6][10] - To address these challenges, a specialized 1v6 mentoring course in reinforcement learning has been introduced, aimed at helping students produce publishable research papers [7][10] Group 1: Reinforcement Learning Applications - Reinforcement learning is crucial for gait control in humanoid and quadruped robots, enabling them to perform tasks in challenging environments [3][9] - The VLA+RL approach for robotic arms is gaining popularity in academia, enhancing the efficiency and smoothness of robotic operations [4][9] Group 2: Course Structure and Objectives - The 1v6 mentoring course is designed for graduate students and others needing guidance on research papers, featuring weekly live sessions and dedicated teaching assistants [8][10] - The course spans 14 weeks of intensive online training followed by 8 weeks of maintenance support, focusing on various aspects of research paper production, including idea confirmation, project implementation, and writing refinement [10][18] Group 3: Course Content and Deliverables - The curriculum includes topics such as reinforcement learning fundamentals, simulation environments, and writing guidance, with a focus on producing a research paper suitable for top conferences and journals [10][19] - Students will receive structured templates and support for writing and submission processes, ensuring they meet the standards of leading academic publications [10][29] Group 4: Instructor and Support - The course is led by experienced instructors with backgrounds in embodied intelligence and robotics, providing both theoretical knowledge and practical insights [27] - Continuous support is offered through a dedicated WeChat group for real-time Q&A, enhancing the learning experience [18][27]
工业界和学术界都在怎么搞端到端和VLA?
自动驾驶之心· 2025-10-17 00:03
Core Insights - The article discusses the evolution of end-to-end algorithms in autonomous driving, highlighting the transition from modular production algorithms to end-to-end and now to Vision-Language Alignment (VLA) models [1][3] - It emphasizes the rich technology stack involved in end-to-end algorithms, including BEV perception, visual language models (VLM), diffusion models, reinforcement learning, and world models [3] Summary by Sections End-to-End Algorithms - End-to-end algorithms are categorized into two main paradigms: single-stage and two-stage, with UniAD being a representative of the single-stage approach [1] - Single-stage can further branch into various subfields, particularly those based on VLA, which have seen a surge in related publications and industrial applications in recent years [1] Courses Offered - The article promotes two courses: "End-to-End and VLA Autonomous Driving Small Class" and "Practical Course on Autonomous Driving VLA and Large Models," aimed at helping individuals quickly and efficiently enter the field [3] - The "Practical Course" focuses on VLA, covering topics from VLM as an autonomous driving interpreter to modular and integrated VLA, along with detailed theoretical foundations [3][12] Instructor Team - The instructor team includes experts from both academia and industry, with backgrounds in multi-modal perception, autonomous driving VLA, and large model frameworks [8][11][14] - Notable instructors have published numerous papers in top-tier conferences and have extensive experience in research and practical applications in autonomous driving and large models [8][11][14] Target Audience - The courses are designed for individuals with a foundational understanding of autonomous driving, familiar with basic modules, and have knowledge of transformer models, reinforcement learning, and BEV perception [15][17]
即将开课!自动驾驶VLA全栈学习路线图分享~
自动驾驶之心· 2025-10-15 23:33
Core Insights - The focus of academia and industry has shifted towards VLA (Vision-Language Action) in autonomous driving, which provides human-like reasoning capabilities for vehicle decision-making [1][4] - Traditional methods in perception and lane detection have matured, leading to decreased attention in these areas, while VLA is now a critical area for development among major autonomous driving companies [4][6] Summary by Sections Introduction to VLA - VLA is categorized into modular VLA, integrated VLA, and reasoning-enhanced VLA, which are essential for improving the reliability and safety of autonomous driving [1][4] Course Overview - A comprehensive course on autonomous driving VLA has been designed, covering foundational principles to practical applications, including cutting-edge algorithms like CoT, MoE, RAG, and reinforcement learning [6][12] Course Structure - The course consists of six chapters, starting with an introduction to VLA algorithms, followed by foundational algorithms, VLM as an interpreter, modular and integrated VLA, reasoning-enhanced VLA, and a final project [12][20] Chapter Highlights - Chapter 1 provides an overview of VLA algorithms and their development history, along with benchmarks and evaluation metrics [13] - Chapter 2 focuses on the foundational knowledge of Vision, Language, and Action modules, including the deployment of large models [14] - Chapter 3 discusses VLM's role as an interpreter in autonomous driving, covering classic and recent algorithms [15] - Chapter 4 delves into modular and integrated VLA, emphasizing the evolution of language models in planning and control [16] - Chapter 5 explores reasoning-enhanced VLA, introducing new modules for decision-making and action generation [17][19] Learning Outcomes - The course aims to deepen understanding of VLA's current advancements, core algorithms, and applications in projects, benefiting participants in internships and job placements [24]
波士顿动力狗gogo回来了,“五条腿”协同发力
3 6 Ke· 2025-10-15 13:02
Core Insights - Boston Dynamics' Spot robot can lift a 15 kg tire in just 3.7 seconds, showcasing advanced dynamic whole-body manipulation techniques [1][11] - The robot's performance exceeds traditional static assumptions, demonstrating the ability to coordinate movements effectively beyond its maximum lifting capacity [13] Group 1: Dynamic Whole-Body Manipulation - The method combines sampling and learning to enable the robot to perform tasks requiring coordination of arms, legs, and torso [1][2] - A hierarchical control approach divides the control problem into two layers: low-level control for balance and stability, and high-level control for task-specific strategies [2][14] Group 2: Control Strategies - The low-level control uses reinforcement learning to manage motor torque for stability, while high-level control employs sampling-based strategies for tasks like tire alignment and stacking [2][7] - The sampling controller simulates multiple future scenarios in parallel to identify the most effective actions for task completion [3][5] Group 3: Performance Metrics - The robot achieved an average time of 5.9 seconds per tire, nearly matching human operational speed [11] - The dynamic coordination allows the robot to handle weights significantly exceeding its peak lifting capabilities, expanding its operational range [13][14] Group 4: Learning and Adaptation - The training process incorporates randomization of object properties to bridge the gap between simulation and real-world application [10] - The use of an asymmetric actor-critic architecture for training enhances the robot's ability to adapt to complex dynamics and contact mechanics [8][10]
Sutton判定「LLM是死胡同」后,新访谈揭示AI困境
机器之心· 2025-10-15 07:33
Core Viewpoint - The article discusses Rich Sutton's critical perspective on large language models (LLMs), suggesting they may not align with the principles outlined in his work "The Bitter Lesson" and highlighting their limitations in learning from real-world interactions [1][3][22]. Group 1: Limitations of LLMs - Sutton argues that LLMs have significant flaws, particularly their inability to learn from ongoing interactions with the environment [3][21]. - He emphasizes that true intelligence should emerge from continuous reinforcement learning through dynamic interactions, rather than relying on extensive pre-training and supervised fine-tuning [3][4][22]. - The reliance on human knowledge and data in LLMs may lead to a lack of scalability and potential failure to meet expectations, as they are fundamentally limited by the biases present in the training data [24][25][26]. Group 2: Alternative Perspectives on Intelligence - Experts in the discussion, including Suzanne Gildert and Niamh Gavin, express skepticism about achieving pure reinforcement learning, suggesting that current systems often revert to imitation learning due to the difficulty in defining universal reward functions [7][11]. - The conversation highlights the need for systems that can autonomously learn in new environments, akin to how a squirrel learns to hide nuts, rather than relying solely on pre-existing data [8][10]. - There is a consensus that while LLMs exhibit impressive capabilities, they do not equate to true intelligence, as they lack the ability to explore and learn from their environment effectively [33][35]. Group 3: The Future of AI Development - The article suggests that the AI field is at a crossroads, where the dominance of certain paradigms may hinder innovation and lead to a cycle of self-limitation [28][29]. - Sutton warns that the current trajectory of LLMs, heavily reliant on human imitation, may not yield the breakthroughs needed for genuine understanding and reasoning capabilities [22][24]. - The discussion indicates a shift towards exploring more robust learning mechanisms that prioritize experience and exploration over mere data absorption [28][30].