Workflow
强化学习
icon
Search documents
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
3 6 Ke· 2025-09-01 04:38
Group 1 - The core concept of the article revolves around the evolution of post-training methods in large language models, particularly focusing on the GRPO algorithm as a significant advancement in reinforcement learning paradigms [2][46]. - GRPO has emerged as a universal reinforcement learning algorithm applicable to a wide range of post-training tasks, with notable improvements over previous methods like PPO [2][48]. - The article discusses the importance of post-training in enhancing the adaptability and flexibility of models, addressing the limitations of pre-training alone [5][46]. Group 2 - The article highlights the transition from PPO to GRPO, emphasizing the reduction of computational costs and memory requirements, making GRPO a more efficient alternative [18][14]. - GRPO's methodology involves using historical performance data to establish a baseline for advantage estimation, eliminating the need for a separate value function [16][14]. - Despite its advantages, GRPO still faces stability issues, prompting further research and development of improved algorithms like DAPO and GSPO [19][48]. Group 3 - DAPO, developed by ByteDance and Tsinghua AIR, builds upon GRPO by introducing enhancements such as Clip-Higher and dynamic sampling to improve training efficiency [20][21]. - GSPO represents a significant advancement by shifting the focus from token-level to sequence-level importance sampling, which enhances training stability [28][30]. - GFPO addresses the limitations of GRPO by allowing for the simultaneous optimization of multiple response attributes, thus improving the overall performance of models [33][34].
RLinf开源!首个面向具身智能“渲训推一体化”的大规模强化学习框架
具身智能之心· 2025-09-01 04:02
Core Viewpoint - The article discusses the launch of RLinf, a large-scale reinforcement learning framework aimed at embodied intelligence, highlighting its innovative design and capabilities in enhancing AI's transition from perception to action [2][5]. Group 1: Framework Overview - RLinf is a flexible and scalable framework designed for embodied intelligence, integrating various components to optimize performance [5]. - The framework's name "inf" signifies both "infrastructure" and "infinite" scaling, emphasizing its adaptable system design [7]. - RLinf features a hybrid execution model that achieves over 120% system speedup compared to traditional frameworks, with VLA model performance improvements of 40%-60% [7][12]. Group 2: Execution Modes - RLinf supports three execution modes: Collocated, Disaggregated, and Hybrid, allowing users to configure components based on their needs [17][15]. - The hybrid mode combines the advantages of both shared and separated execution, minimizing system idle time and enhancing efficiency [12][15]. Group 3: Communication and Scheduling - The framework includes an adaptive communication library designed for reinforcement learning, optimizing data exchange between components [19][22]. - RLinf features an automated scheduling module that minimizes resource idleness and dynamically adjusts to user training flows, achieving rapid scaling capabilities [23][24]. Group 4: Performance Metrics - RLinf has demonstrated significant performance improvements in embodied intelligence tasks, achieving success rates of 80%-90% in specific scenarios, compared to 30%-50% in previous models [24][26]. - The framework has also achieved state-of-the-art (SOTA) performance in mathematical reasoning tasks across multiple datasets, showcasing its versatility [29][30]. Group 5: Documentation and Community Engagement - Comprehensive documentation and API support are provided to enhance user experience and facilitate understanding of the framework [32][34]. - The RLinf team encourages collaboration and invites users to explore the framework, highlighting ongoing recruitment for various research and engineering positions [33][34].
红杉美国:未来一年,这五个AI赛道我们重点关注
创业邦· 2025-09-01 03:48
Core Insights - Sequoia Capital views the AI revolution as a transformative event comparable to the Industrial Revolution, presenting a $10 trillion opportunity in the service industry, with only $20 billion currently automated by AI [1][7][13]. Investment Themes - **Theme 1: Persistent Memory** The concept of persistent memory involves both long-term memory for AI to retain shared context and the identity of AI agents to maintain their unique characteristics over time. This area remains largely unsolved, presenting a significant opportunity [30]. - **Theme 2: Seamless Communication Protocols** The need for standardized communication protocols among AI agents is critical for seamless collaboration, similar to the TCP/IP protocols during the internet revolution. This could transform business models by allowing AI agents to interact autonomously [32]. - **Theme 3: AI Voice** AI voice technology is currently maturing, with improvements in fidelity and latency, enabling real-time conversations. Its applications span consumer and enterprise sectors, including logistics and trading [35]. - **Theme 4: AI Security** There is a substantial opportunity in AI security across the development and consumer spectrum, ensuring safe technology development and usage. This includes protecting both users and AI agents from vulnerabilities [37]. - **Theme 5: Open Source AI** Open source AI is at a pivotal moment, with the potential to compete with proprietary models. This is essential for fostering a more open and accessible AI landscape, allowing broader participation in AI development [40].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].
首个为具身智能而生的大规模强化学习框架RLinf!清华、北京中关村学院、无问芯穹等重磅开源
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the launch of RLinf, a large-scale reinforcement learning framework designed for embodied intelligence, emphasizing its flexible and scalable architecture that integrates training, rendering, and inference processes [5][7]. Group 1: Development of RL Framework - The transition in artificial intelligence from "perception" to "action" highlights the importance of embodied intelligence, which is gaining attention in both academia and industry [2][4]. - RLinf is developed collaboratively by Tsinghua University, Beijing Zhongguancun College, and Wuwenchin, aiming to address the limitations of existing frameworks in supporting embodied intelligence [5][7]. Group 2: Features of RLinf - RLinf's architecture consists of six layers: user layer, task layer, execution layer, scheduling layer, communication layer, and hardware layer, allowing for a hybrid execution mode that achieves over 120% system speedup [7][12]. - The framework introduces a Macro-to-Micro Flow (M2Flow) mechanism, enabling flexible construction of training processes while maintaining high programming flexibility and debugging ease [14][15]. Group 3: Execution Modes - RLinf supports three execution modes: Collocated Mode, Disaggregated Mode, and Hybrid Mode, allowing users to configure components for optimal resource utilization [19][20]. - The framework integrates low-intrusion multi-backend solutions to cater to the diverse needs of researchers in the embodied intelligence field [16][20]. Group 4: Communication and Scheduling - RLinf features an adaptive communication library designed for reinforcement learning, optimizing data exchange between components to enhance system efficiency [22][28]. - An automated scheduling module minimizes resource idling by analyzing component performance and selecting the best execution mode, significantly improving training stability [24][25]. Group 5: Performance Metrics - RLinf demonstrates superior performance in embodied intelligence tasks, achieving over 120% efficiency improvement compared to existing frameworks in specific tests [27][33]. - The framework has shown significant success rate improvements in various tasks, with models achieving up to 97.3% success rates in specific scenarios [31][35]. Group 6: Future Development and Community Engagement - The RLinf team emphasizes open-source principles, providing comprehensive documentation and support to enhance user experience and facilitate collaboration [40][41]. - The team is actively recruiting for various positions to further develop and maintain the RLinf framework, inviting community engagement and feedback [42][43].
R-Zero 深度解析:无需人类数据,AI 如何实现自我进化?
机器之心· 2025-08-31 03:54
Core Viewpoint - The article discusses the R-Zero framework, which enables AI models to self-evolve from "zero data" through a collaborative evolution of two AI roles: Challenger and Solver, aiming to overcome the limitations of traditional large language models that rely on extensive human-annotated data [2][3]. Group 1: R-Zero Framework Overview - R-Zero is designed to allow AI to self-generate learning tasks and improve reasoning capabilities without human intervention [11]. - The framework consists of two independent yet collaboratively functioning agents: Challenger (Qθ) and Solver (Sϕ) [6]. - The Challenger acts as a course generator, creating tasks that are at the edge of the Solver's current capabilities, focusing on tasks with high information gain [6]. Group 2: Iterative Process - The process involves an iterative loop where the Challenger trains on the frozen Solver model to generate questions that maximize the Solver's uncertainty [8]. - After each iteration, the enhanced Solver becomes the new target for the Challenger's training, leading to a spiral increase in both agents' capabilities [9]. Group 3: Implementation and Results - The framework generates pseudo-labels through a self-consistency strategy, where the Solver produces multiple candidate answers for each question, selecting the most frequent as the pseudo-label [17]. - A filtering mechanism ensures that only questions with a specific accuracy range are retained for training, enhancing the quality of the learning process [18]. - Experimental results show significant improvements in reasoning capabilities, with the Qwen3-8B-Base model's average score in mathematical benchmarks increasing from 49.18 to 54.69 after three iterations (+5.51) [18]. Group 4: Generalization and Efficiency - The model demonstrates strong generalization capabilities, with average scores in general reasoning benchmarks like MMLU-Pro and SuperGPQA improving by 3.81 points, indicating enhanced core reasoning abilities rather than mere memorization of specific knowledge [19]. - The R-Zero framework can serve as an efficient intermediate training stage, maximizing the value of human-annotated data when used for subsequent fine-tuning [22]. Group 5: Challenges and Limitations - A key challenge identified is the decline in the accuracy of pseudo-labels, which dropped from 79.0% in the first iteration to 63.0% in the third, indicating increased noise in the supervisory signals as task difficulty rises [26]. - The framework's reliance on domains with objective, verifiable answers limits its applicability in areas with subjective evaluation criteria, such as creative writing [26].
波士顿动力机器狗终于有新绝活!工程师:我们也没想到它能做到
机器人大讲堂· 2025-08-30 14:59
Core Viewpoint - Boston Dynamics' Spot robot has showcased impressive new capabilities, including performing backflips, which highlights its advanced engineering and potential applications in various industries [1][3][5]. Group 1: Technical Achievements - Spot can perform multiple backflips and other complex movements, demonstrating agility comparable to that of a gymnast [3][5]. - The engineering team, led by Arun Kumar, initially doubted the feasibility of Spot performing backflips, indicating the experimental nature of the project [5]. - The training for these movements is not merely for show; it aims to ensure Spot can recover quickly from falls while carrying heavy loads in industrial settings [8][10]. Group 2: Training and Development Process - The development process involves iterative testing in simulation environments before deploying successful movements to the physical robot [11]. - The team utilizes reinforcement learning to enhance Spot's performance, achieving speeds over 5.2 meters per second, which is more than three times the default controller's maximum speed [13]. Group 3: Practical Applications - Since its commercial launch in 2020, Spot has been utilized in various industrial applications, including surveying at Ford factories and conducting safety inspections at Kia [14][17]. - Spot has also been involved in radiation surveys for Dominion Energy and automated inspections at Chevron's facilities, showcasing its versatility in different environments [16][17]. Group 4: Public Perception and Engagement - Public performances, such as those on "America's Got Talent," aim to change perceptions of robots, presenting them as engaging and beneficial rather than threatening [20][22]. - The deployment of Spot for unique tasks, such as delivering pizza for Domino's, illustrates its adaptability and potential for diverse applications [18].
消失一年,Kimi杨植麟最新对话:“站在无限的开端”
创业邦· 2025-08-30 03:19
Core Viewpoint - The article discusses the evolution and advancements in AI, particularly focusing on the Kimi K2 model developed by DeepSeek, highlighting the ongoing challenges and the philosophical implications of problem-solving in AI development [4][5][12]. Group 1: Kimi K2 Model Development - The Kimi K2 model, based on the MoE architecture, represents a significant advancement in AI, allowing for open-source programming and interaction with the digital world [4][5]. - The model's release in July 2025 marked a return to public attention for DeepSeek after a period of relative silence from its founder, Yang Zhilin [4][5]. - The development process involved a shift from pre-training and supervised fine-tuning to a focus on pre-training and reinforcement learning, which significantly impacted the company's operational methods [27][28]. Group 2: Philosophical Insights - Yang Zhilin emphasizes that human civilization is a continuous process of conquering problems and expanding knowledge boundaries, drawing inspiration from David Deutsch's book "The Beginning of Infinity" [5][12]. - The notion that every solved problem leads to new questions is central to the ongoing development of AI, suggesting an infinite journey of exploration and innovation [5][12]. Group 3: Technical Innovations - The K2 model aims to maximize token efficiency, allowing the model to learn more effectively from the same amount of data, which is crucial given the slow growth of high-quality data [29][30]. - The introduction of the Muon optimizer significantly enhances token efficiency, enabling the model to learn from data more effectively than traditional optimizers like Adam [30][31]. - The model's ability to perform complex tasks over extended periods without human intervention is a notable advancement, showcasing the potential for end-to-end automation in AI applications [17][44]. Group 4: Agentic Capabilities - The K2 model is characterized as an Agentic model, capable of multi-turn interactions and utilizing various tools to connect with the external world, which enhances its problem-solving capabilities [43][44]. - The development of multi-agent systems is highlighted as a way to improve task execution and collaboration among different agents, allowing for more complex problem-solving [22][44]. - The challenge of generalization in agent models is acknowledged, with ongoing efforts to improve their adaptability to various tasks and environments [34][46].
红杉美国:未来一年,这五个 AI 赛道我们重点关注
Founder Park· 2025-08-29 12:19
Core Viewpoint - Sequoia Capital believes that the AI revolution will be a transformative change comparable to the Industrial Revolution, presenting a $10 trillion opportunity in the service industry, with only $20 billion currently automated by AI [2][11]. Investment Themes - Sequoia will focus on five key investment themes over the next 12-18 months: persistent memory, communication protocols, AI voice, AI security, and open-source AI [2][30]. Historical Context - The article draws parallels between the current AI revolution and historical milestones of the Industrial Revolution, emphasizing the importance of specialization in the development of complex systems [5][7][10]. Market Potential - The U.S. service industry market is valued at $10 trillion, with only $20 billion currently impacted by AI, indicating a massive growth opportunity [11][13]. Investment Trends - Five observed investment trends include: 1. Leverage over certainty, where AI agents can significantly increase productivity despite some uncertainty [21]. 2. Real-world validation of AI capabilities, moving beyond academic benchmarks [23]. 3. The practical application of reinforcement learning in industry [25]. 4. AI's integration into the physical world, enhancing processes and hardware [27]. 5. Computing becoming a new productivity function, with knowledge workers' computational needs expected to increase dramatically [29]. Focus Areas for Investment - Persistent memory is crucial for AI to integrate deeply into business processes, with ongoing challenges in this area [31]. - Seamless communication protocols are needed for AI agents to collaborate effectively, similar to the TCP/IP standard in the internet revolution [34]. - AI voice technology is currently maturing, with applications in consumer and enterprise sectors [36][37]. - AI security presents a significant opportunity across the development and consumer usage spectrum [39]. - Open-source AI is at a critical juncture, with the potential to compete with proprietary models, fostering a more open future [41].
不愧是中国机器人,乒乓打得太6了
量子位· 2025-08-29 11:37
Core Viewpoint - The article discusses the advancements in humanoid robots, specifically focusing on a table tennis robot developed by Tsinghua University students, showcasing its ability to perform high-level table tennis skills through a combination of hierarchical planning and reinforcement learning [7][8]. Group 1: Robot Performance - The robot can respond with a reaction time of 0.42 seconds and has achieved a maximum of 106 consecutive hits during a match [3][5][23]. - In real-world tests, the robot successfully returned 24 out of 26 balls, achieving a hitting rate of 96.2% and a return rate of 92.3% [21]. Group 2: Technical Framework - The research team proposed a hierarchical framework that separates high-level planning from low-level control, allowing the robot to predict ball trajectories and execute human-like movements [9][11]. - A model-based planner predicts the ball's position, speed, and timing, while a reinforcement learning-based controller generates coordinated movements [10][16]. Group 3: Training Methodology - The robot was trained using a standard table tennis setup, with its hand modified to function as a paddle [13]. - The training incorporated human motion references to encourage the robot to mimic human-like swinging actions [18][19]. Group 4: Challenges in Robotics - Table tennis is highlighted as a challenging sport for robots due to the need for rapid perception, prediction, planning, and execution within a very short time frame [29][30]. - The sport requires agile full-body movements, including quick arm swings, waist rotations, and balance recovery, making it a complex task for humanoid robots [32][33].