Workflow
强化学习
icon
Search documents
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].
首个为具身智能而生的大规模强化学习框架RLinf!清华、北京中关村学院、无问芯穹等重磅开源
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the launch of RLinf, a large-scale reinforcement learning framework designed for embodied intelligence, emphasizing its flexible and scalable architecture that integrates training, rendering, and inference processes [5][7]. Group 1: Development of RL Framework - The transition in artificial intelligence from "perception" to "action" highlights the importance of embodied intelligence, which is gaining attention in both academia and industry [2][4]. - RLinf is developed collaboratively by Tsinghua University, Beijing Zhongguancun College, and Wuwenchin, aiming to address the limitations of existing frameworks in supporting embodied intelligence [5][7]. Group 2: Features of RLinf - RLinf's architecture consists of six layers: user layer, task layer, execution layer, scheduling layer, communication layer, and hardware layer, allowing for a hybrid execution mode that achieves over 120% system speedup [7][12]. - The framework introduces a Macro-to-Micro Flow (M2Flow) mechanism, enabling flexible construction of training processes while maintaining high programming flexibility and debugging ease [14][15]. Group 3: Execution Modes - RLinf supports three execution modes: Collocated Mode, Disaggregated Mode, and Hybrid Mode, allowing users to configure components for optimal resource utilization [19][20]. - The framework integrates low-intrusion multi-backend solutions to cater to the diverse needs of researchers in the embodied intelligence field [16][20]. Group 4: Communication and Scheduling - RLinf features an adaptive communication library designed for reinforcement learning, optimizing data exchange between components to enhance system efficiency [22][28]. - An automated scheduling module minimizes resource idling by analyzing component performance and selecting the best execution mode, significantly improving training stability [24][25]. Group 5: Performance Metrics - RLinf demonstrates superior performance in embodied intelligence tasks, achieving over 120% efficiency improvement compared to existing frameworks in specific tests [27][33]. - The framework has shown significant success rate improvements in various tasks, with models achieving up to 97.3% success rates in specific scenarios [31][35]. Group 6: Future Development and Community Engagement - The RLinf team emphasizes open-source principles, providing comprehensive documentation and support to enhance user experience and facilitate collaboration [40][41]. - The team is actively recruiting for various positions to further develop and maintain the RLinf framework, inviting community engagement and feedback [42][43].
R-Zero 深度解析:无需人类数据,AI 如何实现自我进化?
机器之心· 2025-08-31 03:54
Core Viewpoint - The article discusses the R-Zero framework, which enables AI models to self-evolve from "zero data" through a collaborative evolution of two AI roles: Challenger and Solver, aiming to overcome the limitations of traditional large language models that rely on extensive human-annotated data [2][3]. Group 1: R-Zero Framework Overview - R-Zero is designed to allow AI to self-generate learning tasks and improve reasoning capabilities without human intervention [11]. - The framework consists of two independent yet collaboratively functioning agents: Challenger (Qθ) and Solver (Sϕ) [6]. - The Challenger acts as a course generator, creating tasks that are at the edge of the Solver's current capabilities, focusing on tasks with high information gain [6]. Group 2: Iterative Process - The process involves an iterative loop where the Challenger trains on the frozen Solver model to generate questions that maximize the Solver's uncertainty [8]. - After each iteration, the enhanced Solver becomes the new target for the Challenger's training, leading to a spiral increase in both agents' capabilities [9]. Group 3: Implementation and Results - The framework generates pseudo-labels through a self-consistency strategy, where the Solver produces multiple candidate answers for each question, selecting the most frequent as the pseudo-label [17]. - A filtering mechanism ensures that only questions with a specific accuracy range are retained for training, enhancing the quality of the learning process [18]. - Experimental results show significant improvements in reasoning capabilities, with the Qwen3-8B-Base model's average score in mathematical benchmarks increasing from 49.18 to 54.69 after three iterations (+5.51) [18]. Group 4: Generalization and Efficiency - The model demonstrates strong generalization capabilities, with average scores in general reasoning benchmarks like MMLU-Pro and SuperGPQA improving by 3.81 points, indicating enhanced core reasoning abilities rather than mere memorization of specific knowledge [19]. - The R-Zero framework can serve as an efficient intermediate training stage, maximizing the value of human-annotated data when used for subsequent fine-tuning [22]. Group 5: Challenges and Limitations - A key challenge identified is the decline in the accuracy of pseudo-labels, which dropped from 79.0% in the first iteration to 63.0% in the third, indicating increased noise in the supervisory signals as task difficulty rises [26]. - The framework's reliance on domains with objective, verifiable answers limits its applicability in areas with subjective evaluation criteria, such as creative writing [26].
波士顿动力机器狗终于有新绝活!工程师:我们也没想到它能做到
机器人大讲堂· 2025-08-30 14:59
Core Viewpoint - Boston Dynamics' Spot robot has showcased impressive new capabilities, including performing backflips, which highlights its advanced engineering and potential applications in various industries [1][3][5]. Group 1: Technical Achievements - Spot can perform multiple backflips and other complex movements, demonstrating agility comparable to that of a gymnast [3][5]. - The engineering team, led by Arun Kumar, initially doubted the feasibility of Spot performing backflips, indicating the experimental nature of the project [5]. - The training for these movements is not merely for show; it aims to ensure Spot can recover quickly from falls while carrying heavy loads in industrial settings [8][10]. Group 2: Training and Development Process - The development process involves iterative testing in simulation environments before deploying successful movements to the physical robot [11]. - The team utilizes reinforcement learning to enhance Spot's performance, achieving speeds over 5.2 meters per second, which is more than three times the default controller's maximum speed [13]. Group 3: Practical Applications - Since its commercial launch in 2020, Spot has been utilized in various industrial applications, including surveying at Ford factories and conducting safety inspections at Kia [14][17]. - Spot has also been involved in radiation surveys for Dominion Energy and automated inspections at Chevron's facilities, showcasing its versatility in different environments [16][17]. Group 4: Public Perception and Engagement - Public performances, such as those on "America's Got Talent," aim to change perceptions of robots, presenting them as engaging and beneficial rather than threatening [20][22]. - The deployment of Spot for unique tasks, such as delivering pizza for Domino's, illustrates its adaptability and potential for diverse applications [18].
消失一年,Kimi杨植麟最新对话:“站在无限的开端”
创业邦· 2025-08-30 03:19
Core Viewpoint - The article discusses the evolution and advancements in AI, particularly focusing on the Kimi K2 model developed by DeepSeek, highlighting the ongoing challenges and the philosophical implications of problem-solving in AI development [4][5][12]. Group 1: Kimi K2 Model Development - The Kimi K2 model, based on the MoE architecture, represents a significant advancement in AI, allowing for open-source programming and interaction with the digital world [4][5]. - The model's release in July 2025 marked a return to public attention for DeepSeek after a period of relative silence from its founder, Yang Zhilin [4][5]. - The development process involved a shift from pre-training and supervised fine-tuning to a focus on pre-training and reinforcement learning, which significantly impacted the company's operational methods [27][28]. Group 2: Philosophical Insights - Yang Zhilin emphasizes that human civilization is a continuous process of conquering problems and expanding knowledge boundaries, drawing inspiration from David Deutsch's book "The Beginning of Infinity" [5][12]. - The notion that every solved problem leads to new questions is central to the ongoing development of AI, suggesting an infinite journey of exploration and innovation [5][12]. Group 3: Technical Innovations - The K2 model aims to maximize token efficiency, allowing the model to learn more effectively from the same amount of data, which is crucial given the slow growth of high-quality data [29][30]. - The introduction of the Muon optimizer significantly enhances token efficiency, enabling the model to learn from data more effectively than traditional optimizers like Adam [30][31]. - The model's ability to perform complex tasks over extended periods without human intervention is a notable advancement, showcasing the potential for end-to-end automation in AI applications [17][44]. Group 4: Agentic Capabilities - The K2 model is characterized as an Agentic model, capable of multi-turn interactions and utilizing various tools to connect with the external world, which enhances its problem-solving capabilities [43][44]. - The development of multi-agent systems is highlighted as a way to improve task execution and collaboration among different agents, allowing for more complex problem-solving [22][44]. - The challenge of generalization in agent models is acknowledged, with ongoing efforts to improve their adaptability to various tasks and environments [34][46].
红杉美国:未来一年,这五个 AI 赛道我们重点关注
Founder Park· 2025-08-29 12:19
Core Viewpoint - Sequoia Capital believes that the AI revolution will be a transformative change comparable to the Industrial Revolution, presenting a $10 trillion opportunity in the service industry, with only $20 billion currently automated by AI [2][11]. Investment Themes - Sequoia will focus on five key investment themes over the next 12-18 months: persistent memory, communication protocols, AI voice, AI security, and open-source AI [2][30]. Historical Context - The article draws parallels between the current AI revolution and historical milestones of the Industrial Revolution, emphasizing the importance of specialization in the development of complex systems [5][7][10]. Market Potential - The U.S. service industry market is valued at $10 trillion, with only $20 billion currently impacted by AI, indicating a massive growth opportunity [11][13]. Investment Trends - Five observed investment trends include: 1. Leverage over certainty, where AI agents can significantly increase productivity despite some uncertainty [21]. 2. Real-world validation of AI capabilities, moving beyond academic benchmarks [23]. 3. The practical application of reinforcement learning in industry [25]. 4. AI's integration into the physical world, enhancing processes and hardware [27]. 5. Computing becoming a new productivity function, with knowledge workers' computational needs expected to increase dramatically [29]. Focus Areas for Investment - Persistent memory is crucial for AI to integrate deeply into business processes, with ongoing challenges in this area [31]. - Seamless communication protocols are needed for AI agents to collaborate effectively, similar to the TCP/IP standard in the internet revolution [34]. - AI voice technology is currently maturing, with applications in consumer and enterprise sectors [36][37]. - AI security presents a significant opportunity across the development and consumer usage spectrum [39]. - Open-source AI is at a critical juncture, with the potential to compete with proprietary models, fostering a more open future [41].
不愧是中国机器人,乒乓打得太6了
量子位· 2025-08-29 11:37
Core Viewpoint - The article discusses the advancements in humanoid robots, specifically focusing on a table tennis robot developed by Tsinghua University students, showcasing its ability to perform high-level table tennis skills through a combination of hierarchical planning and reinforcement learning [7][8]. Group 1: Robot Performance - The robot can respond with a reaction time of 0.42 seconds and has achieved a maximum of 106 consecutive hits during a match [3][5][23]. - In real-world tests, the robot successfully returned 24 out of 26 balls, achieving a hitting rate of 96.2% and a return rate of 92.3% [21]. Group 2: Technical Framework - The research team proposed a hierarchical framework that separates high-level planning from low-level control, allowing the robot to predict ball trajectories and execute human-like movements [9][11]. - A model-based planner predicts the ball's position, speed, and timing, while a reinforcement learning-based controller generates coordinated movements [10][16]. Group 3: Training Methodology - The robot was trained using a standard table tennis setup, with its hand modified to function as a paddle [13]. - The training incorporated human motion references to encourage the robot to mimic human-like swinging actions [18][19]. Group 4: Challenges in Robotics - Table tennis is highlighted as a challenging sport for robots due to the need for rapid perception, prediction, planning, and execution within a very short time frame [29][30]. - The sport requires agile full-body movements, including quick arm swings, waist rotations, and balance recovery, making it a complex task for humanoid robots [32][33].
谢赛宁回忆七年前OpenAI面试:白板编程、五小时会议,面完天都黑了
机器之心· 2025-08-29 09:53
Core Insights - The article discusses the unique interview experiences of AI researchers at major tech companies, highlighting the differences in interview styles and the focus areas of these companies [1][9][20]. Group 1: Interview Experiences - Lucas Beyer, a researcher with extensive experience at top AI firms, initiated a poll about memorable interview experiences at companies like Google, Meta, and OpenAI [2][20]. - Saining Xie shared that his interviews at various AI companies were unforgettable, particularly noting the rigorous two-hour marathon interview at DeepMind, which involved solving over 100 math and machine learning problems [5][6]. - The interview process at Meta was described as more academic, focusing on discussions with prominent researchers rather than just coding [6][7]. Group 2: Company-Specific Insights - The interview style at Google Research was likened to an academic job interview, with a significant emphasis on research discussions rather than solely on coding challenges [7]. - OpenAI's interview process involved a lengthy session focused on a reinforcement learning problem, showcasing the company's commitment to deep research engagement [8][9]. - The article notes that the interview questions reflect the research priorities of these companies, such as Meta's focus on computer vision and OpenAI's emphasis on reinforcement learning [9][20]. Group 3: Notable Interviewers and Candidates - Notable figures like John Schulman and Noam Shazeer were mentioned as interviewers, indicating the high caliber of talent involved in the hiring processes at these firms [7][9]. - Candidates shared memorable moments from their interviews, such as solving complex problems on napkins or engaging in deep discussions about research topics [19][20].
四足机械狗+单臂,低成本开启你的具身学习之旅
具身智能之心· 2025-08-29 04:00
Core Viewpoint - Xdog is a low-cost, multifunctional quadruped robotic dog and robotic arm development platform designed for embodied developers, featuring a comprehensive curriculum for research and learning in robotics [1][2]. Group 1: Hardware Overview - Xdog integrates a robotic dog and robotic arm, with advanced functionalities such as voice control, sim2real, real2sim, target recognition and tracking, autonomous grasping, and reinforcement learning gait control [2][5]. - The robotic dog measures 25cm x 20cm x 30cm and weighs 7.0kg, with a maximum speed of 7.2 km/h and a maximum rotation speed of 450 degrees per second [3][11]. - The main control chip is Allwinner H616, featuring a quad-core 1.6GHz CPU, 4GB RAM, and 32GB storage [4][5]. Group 2: Technical Specifications - The robotic dog has a battery capacity of 93.24Wh, providing approximately 120 minutes of operational time and a standby time of about 6 hours [5][11]. - The robotic arm can reach a maximum height of 0.85m and has a grasping range of 0.4m around its base [7]. - The depth camera features active dual infrared and structured light technology, with a depth output resolution of 1280 × 800 @ 30 fps and a working distance of 0.2m - 10m [14]. Group 3: Software and Functionality - The system supports various control methods including voice control, keyboard control, visual control, and reinforcement learning for autonomous movement [15][17]. - Development is based on ROS1, with Python as the primary programming language, and it is recommended to use a GPU of at least 2080ti for inference [16][24]. - The platform allows for advanced functionalities such as collaborative control of the robotic arm and dog for target following, and autonomous grasping capabilities [19][20]. Group 4: Educational Curriculum - The curriculum includes hands-on training in ROS project creation, Mujoco simulation, and reinforcement learning principles, among other topics [22][23]. - Courses cover the setup and usage of the Xdog system, including network configuration, camera parameter adjustments, and advanced algorithms for object recognition and tracking [22][23]. - The teaching team consists of experienced instructors responsible for project management, technical support, and algorithm training [22]. Group 5: Delivery and Support - The delivery cycle is set to be completed within three weeks after payment, with a one-year warranty for after-sales service [25][26]. - The product includes hardware and accompanying courses, with no returns or exchanges allowed for non-quality issues [26].
基于深度强化学习的轨迹规划
自动驾驶之心· 2025-08-28 23:32
Core Viewpoint - The article discusses the advancements and potential of reinforcement learning (RL) in the field of autonomous driving, highlighting its evolution and comparison with other learning paradigms such as supervised learning and imitation learning [4][7][8]. Summary by Sections Background - The article notes the recent industry focus on new technological paradigms like VLA and reinforcement learning, emphasizing the growing interest in RL following significant milestones in AI, such as AlphaZero and ChatGPT [4]. Supervised Learning - In autonomous driving, perception tasks like object detection are framed as supervised learning tasks, where a model is trained to map inputs to outputs using labeled data [5]. Imitation Learning - Imitation learning involves training models to replicate actions based on observed behaviors, akin to how a child learns from adults. This is a primary learning objective in end-to-end autonomous driving [6]. Reinforcement Learning - Reinforcement learning differs from imitation learning by focusing on learning through interaction with the environment, using feedback from task outcomes to optimize the model. It is particularly relevant for sequential decision-making tasks in autonomous driving [7]. Inverse Reinforcement Learning - Inverse reinforcement learning addresses the challenge of defining reward functions in complex tasks by learning from user feedback to create a reward model, which can then guide the main model's training [8]. Basic Concepts of Reinforcement Learning - Key concepts include policies, rewards, and value functions, which are essential for understanding how RL operates in autonomous driving contexts [14][15][16]. Markov Decision Process - The article explains the Markov decision process as a framework for modeling sequential tasks, which is applicable to various autonomous driving scenarios [10]. Common Algorithms - Various algorithms are discussed, including dynamic programming, Monte Carlo methods, and temporal difference learning, which are foundational to reinforcement learning [26][30]. Policy Optimization - The article differentiates between on-policy and off-policy algorithms, highlighting their respective advantages and challenges in training stability and data utilization [27][28]. Advanced Reinforcement Learning Techniques - Techniques such as DQN, TRPO, and PPO are introduced, showcasing their roles in enhancing training stability and efficiency in reinforcement learning applications [41][55]. Application in Autonomous Driving - The article emphasizes the importance of reward design and closed-loop training in autonomous driving, where the vehicle's actions influence the environment, necessitating sophisticated modeling techniques [60][61]. Conclusion - The rapid development of reinforcement learning algorithms and their application in autonomous driving is underscored, encouraging practical engagement with the technology [62].