强化学习

Search documents
红杉美国:未来一年,这五个AI赛道我们重点关注
创业邦· 2025-09-01 03:48
Core Insights - Sequoia Capital views the AI revolution as a transformative event comparable to the Industrial Revolution, presenting a $10 trillion opportunity in the service industry, with only $20 billion currently automated by AI [1][7][13]. Investment Themes - **Theme 1: Persistent Memory** The concept of persistent memory involves both long-term memory for AI to retain shared context and the identity of AI agents to maintain their unique characteristics over time. This area remains largely unsolved, presenting a significant opportunity [30]. - **Theme 2: Seamless Communication Protocols** The need for standardized communication protocols among AI agents is critical for seamless collaboration, similar to the TCP/IP protocols during the internet revolution. This could transform business models by allowing AI agents to interact autonomously [32]. - **Theme 3: AI Voice** AI voice technology is currently maturing, with improvements in fidelity and latency, enabling real-time conversations. Its applications span consumer and enterprise sectors, including logistics and trading [35]. - **Theme 4: AI Security** There is a substantial opportunity in AI security across the development and consumer spectrum, ensuring safe technology development and usage. This includes protecting both users and AI agents from vulnerabilities [37]. - **Theme 5: Open Source AI** Open source AI is at a pivotal moment, with the potential to compete with proprietary models. This is essential for fostering a more open and accessible AI landscape, allowing broader participation in AI development [40].
科普向:一文解构大模型后训练,GRPO和它的继任者们的前世今生
机器之心· 2025-09-01 02:49
Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].
首个为具身智能而生的大规模强化学习框架RLinf!清华、北京中关村学院、无问芯穹等重磅开源
机器之心· 2025-09-01 02:49
清华大学、北京中关村学院、无问芯穹联合北大、伯克利等机构重磅开源RLinf:首个面向具身智能的"渲训推一体化"大规模强化学习框架。 人工智能正在经历从 "感知" 到 "行动" 的跨越式发展,融合大模型的具身智能被认为是人工智能的下一发展阶段,成为学术界与工业界共同关注的话题。 机器之心报道 在大模型领域,随着 o1/R1 系列推理模型的发布,模型训练的重心逐渐从数据驱动的预训练 / 后训练转向奖励驱动的强化学习(Reinforcement Learning, RL)。 OpenAI 预测强化学习所需要的算力甚至将超过预训练。与此同时,能够将大规模算力高效利用的 RL infra 的重要性也日益凸显,近期也涌现出一批优秀的框架, 极大地促进了该领域的发展。 机器之心编辑部 图 1 : OpenAI 在红杉资本闭门会上的分享 然而,当前框架对具身智能的支持仍然受限。相比推理大模型这一类纯大脑模型,具身智能领域存在大脑(侧重推理、长程规划,如RoboBrain)、小脑(侧重执 行、短程操作,如OpenVLA)及大小脑联合(快慢系统,如pi 0.5)等多样模型。 其次, 具身智能除了包含Agentic AI的多步决策 ...
R-Zero 深度解析:无需人类数据,AI 如何实现自我进化?
机器之心· 2025-08-31 03:54
Core Viewpoint - The article discusses the R-Zero framework, which enables AI models to self-evolve from "zero data" through a collaborative evolution of two AI roles: Challenger and Solver, aiming to overcome the limitations of traditional large language models that rely on extensive human-annotated data [2][3]. Group 1: R-Zero Framework Overview - R-Zero is designed to allow AI to self-generate learning tasks and improve reasoning capabilities without human intervention [11]. - The framework consists of two independent yet collaboratively functioning agents: Challenger (Qθ) and Solver (Sϕ) [6]. - The Challenger acts as a course generator, creating tasks that are at the edge of the Solver's current capabilities, focusing on tasks with high information gain [6]. Group 2: Iterative Process - The process involves an iterative loop where the Challenger trains on the frozen Solver model to generate questions that maximize the Solver's uncertainty [8]. - After each iteration, the enhanced Solver becomes the new target for the Challenger's training, leading to a spiral increase in both agents' capabilities [9]. Group 3: Implementation and Results - The framework generates pseudo-labels through a self-consistency strategy, where the Solver produces multiple candidate answers for each question, selecting the most frequent as the pseudo-label [17]. - A filtering mechanism ensures that only questions with a specific accuracy range are retained for training, enhancing the quality of the learning process [18]. - Experimental results show significant improvements in reasoning capabilities, with the Qwen3-8B-Base model's average score in mathematical benchmarks increasing from 49.18 to 54.69 after three iterations (+5.51) [18]. Group 4: Generalization and Efficiency - The model demonstrates strong generalization capabilities, with average scores in general reasoning benchmarks like MMLU-Pro and SuperGPQA improving by 3.81 points, indicating enhanced core reasoning abilities rather than mere memorization of specific knowledge [19]. - The R-Zero framework can serve as an efficient intermediate training stage, maximizing the value of human-annotated data when used for subsequent fine-tuning [22]. Group 5: Challenges and Limitations - A key challenge identified is the decline in the accuracy of pseudo-labels, which dropped from 79.0% in the first iteration to 63.0% in the third, indicating increased noise in the supervisory signals as task difficulty rises [26]. - The framework's reliance on domains with objective, verifiable answers limits its applicability in areas with subjective evaluation criteria, such as creative writing [26].
波士顿动力机器狗终于有新绝活!工程师:我们也没想到它能做到
机器人大讲堂· 2025-08-30 14:59
Core Viewpoint - Boston Dynamics' Spot robot has showcased impressive new capabilities, including performing backflips, which highlights its advanced engineering and potential applications in various industries [1][3][5]. Group 1: Technical Achievements - Spot can perform multiple backflips and other complex movements, demonstrating agility comparable to that of a gymnast [3][5]. - The engineering team, led by Arun Kumar, initially doubted the feasibility of Spot performing backflips, indicating the experimental nature of the project [5]. - The training for these movements is not merely for show; it aims to ensure Spot can recover quickly from falls while carrying heavy loads in industrial settings [8][10]. Group 2: Training and Development Process - The development process involves iterative testing in simulation environments before deploying successful movements to the physical robot [11]. - The team utilizes reinforcement learning to enhance Spot's performance, achieving speeds over 5.2 meters per second, which is more than three times the default controller's maximum speed [13]. Group 3: Practical Applications - Since its commercial launch in 2020, Spot has been utilized in various industrial applications, including surveying at Ford factories and conducting safety inspections at Kia [14][17]. - Spot has also been involved in radiation surveys for Dominion Energy and automated inspections at Chevron's facilities, showcasing its versatility in different environments [16][17]. Group 4: Public Perception and Engagement - Public performances, such as those on "America's Got Talent," aim to change perceptions of robots, presenting them as engaging and beneficial rather than threatening [20][22]. - The deployment of Spot for unique tasks, such as delivering pizza for Domino's, illustrates its adaptability and potential for diverse applications [18].
消失一年,Kimi杨植麟最新对话:“站在无限的开端”
创业邦· 2025-08-30 03:19
Core Viewpoint - The article discusses the evolution and advancements in AI, particularly focusing on the Kimi K2 model developed by DeepSeek, highlighting the ongoing challenges and the philosophical implications of problem-solving in AI development [4][5][12]. Group 1: Kimi K2 Model Development - The Kimi K2 model, based on the MoE architecture, represents a significant advancement in AI, allowing for open-source programming and interaction with the digital world [4][5]. - The model's release in July 2025 marked a return to public attention for DeepSeek after a period of relative silence from its founder, Yang Zhilin [4][5]. - The development process involved a shift from pre-training and supervised fine-tuning to a focus on pre-training and reinforcement learning, which significantly impacted the company's operational methods [27][28]. Group 2: Philosophical Insights - Yang Zhilin emphasizes that human civilization is a continuous process of conquering problems and expanding knowledge boundaries, drawing inspiration from David Deutsch's book "The Beginning of Infinity" [5][12]. - The notion that every solved problem leads to new questions is central to the ongoing development of AI, suggesting an infinite journey of exploration and innovation [5][12]. Group 3: Technical Innovations - The K2 model aims to maximize token efficiency, allowing the model to learn more effectively from the same amount of data, which is crucial given the slow growth of high-quality data [29][30]. - The introduction of the Muon optimizer significantly enhances token efficiency, enabling the model to learn from data more effectively than traditional optimizers like Adam [30][31]. - The model's ability to perform complex tasks over extended periods without human intervention is a notable advancement, showcasing the potential for end-to-end automation in AI applications [17][44]. Group 4: Agentic Capabilities - The K2 model is characterized as an Agentic model, capable of multi-turn interactions and utilizing various tools to connect with the external world, which enhances its problem-solving capabilities [43][44]. - The development of multi-agent systems is highlighted as a way to improve task execution and collaboration among different agents, allowing for more complex problem-solving [22][44]. - The challenge of generalization in agent models is acknowledged, with ongoing efforts to improve their adaptability to various tasks and environments [34][46].
红杉美国:未来一年,这五个 AI 赛道我们重点关注
Founder Park· 2025-08-29 12:19
Core Viewpoint - Sequoia Capital believes that the AI revolution will be a transformative change comparable to the Industrial Revolution, presenting a $10 trillion opportunity in the service industry, with only $20 billion currently automated by AI [2][11]. Investment Themes - Sequoia will focus on five key investment themes over the next 12-18 months: persistent memory, communication protocols, AI voice, AI security, and open-source AI [2][30]. Historical Context - The article draws parallels between the current AI revolution and historical milestones of the Industrial Revolution, emphasizing the importance of specialization in the development of complex systems [5][7][10]. Market Potential - The U.S. service industry market is valued at $10 trillion, with only $20 billion currently impacted by AI, indicating a massive growth opportunity [11][13]. Investment Trends - Five observed investment trends include: 1. Leverage over certainty, where AI agents can significantly increase productivity despite some uncertainty [21]. 2. Real-world validation of AI capabilities, moving beyond academic benchmarks [23]. 3. The practical application of reinforcement learning in industry [25]. 4. AI's integration into the physical world, enhancing processes and hardware [27]. 5. Computing becoming a new productivity function, with knowledge workers' computational needs expected to increase dramatically [29]. Focus Areas for Investment - Persistent memory is crucial for AI to integrate deeply into business processes, with ongoing challenges in this area [31]. - Seamless communication protocols are needed for AI agents to collaborate effectively, similar to the TCP/IP standard in the internet revolution [34]. - AI voice technology is currently maturing, with applications in consumer and enterprise sectors [36][37]. - AI security presents a significant opportunity across the development and consumer usage spectrum [39]. - Open-source AI is at a critical juncture, with the potential to compete with proprietary models, fostering a more open future [41].
不愧是中国机器人,乒乓打得太6了
量子位· 2025-08-29 11:37
Core Viewpoint - The article discusses the advancements in humanoid robots, specifically focusing on a table tennis robot developed by Tsinghua University students, showcasing its ability to perform high-level table tennis skills through a combination of hierarchical planning and reinforcement learning [7][8]. Group 1: Robot Performance - The robot can respond with a reaction time of 0.42 seconds and has achieved a maximum of 106 consecutive hits during a match [3][5][23]. - In real-world tests, the robot successfully returned 24 out of 26 balls, achieving a hitting rate of 96.2% and a return rate of 92.3% [21]. Group 2: Technical Framework - The research team proposed a hierarchical framework that separates high-level planning from low-level control, allowing the robot to predict ball trajectories and execute human-like movements [9][11]. - A model-based planner predicts the ball's position, speed, and timing, while a reinforcement learning-based controller generates coordinated movements [10][16]. Group 3: Training Methodology - The robot was trained using a standard table tennis setup, with its hand modified to function as a paddle [13]. - The training incorporated human motion references to encourage the robot to mimic human-like swinging actions [18][19]. Group 4: Challenges in Robotics - Table tennis is highlighted as a challenging sport for robots due to the need for rapid perception, prediction, planning, and execution within a very short time frame [29][30]. - The sport requires agile full-body movements, including quick arm swings, waist rotations, and balance recovery, making it a complex task for humanoid robots [32][33].
谢赛宁回忆七年前OpenAI面试:白板编程、五小时会议,面完天都黑了
机器之心· 2025-08-29 09:53
Core Insights - The article discusses the unique interview experiences of AI researchers at major tech companies, highlighting the differences in interview styles and the focus areas of these companies [1][9][20]. Group 1: Interview Experiences - Lucas Beyer, a researcher with extensive experience at top AI firms, initiated a poll about memorable interview experiences at companies like Google, Meta, and OpenAI [2][20]. - Saining Xie shared that his interviews at various AI companies were unforgettable, particularly noting the rigorous two-hour marathon interview at DeepMind, which involved solving over 100 math and machine learning problems [5][6]. - The interview process at Meta was described as more academic, focusing on discussions with prominent researchers rather than just coding [6][7]. Group 2: Company-Specific Insights - The interview style at Google Research was likened to an academic job interview, with a significant emphasis on research discussions rather than solely on coding challenges [7]. - OpenAI's interview process involved a lengthy session focused on a reinforcement learning problem, showcasing the company's commitment to deep research engagement [8][9]. - The article notes that the interview questions reflect the research priorities of these companies, such as Meta's focus on computer vision and OpenAI's emphasis on reinforcement learning [9][20]. Group 3: Notable Interviewers and Candidates - Notable figures like John Schulman and Noam Shazeer were mentioned as interviewers, indicating the high caliber of talent involved in the hiring processes at these firms [7][9]. - Candidates shared memorable moments from their interviews, such as solving complex problems on napkins or engaging in deep discussions about research topics [19][20].
四足机械狗+单臂,低成本开启你的具身学习之旅
具身智能之心· 2025-08-29 04:00
Core Viewpoint - Xdog is a low-cost, multifunctional quadruped robotic dog and robotic arm development platform designed for embodied developers, featuring a comprehensive curriculum for research and learning in robotics [1][2]. Group 1: Hardware Overview - Xdog integrates a robotic dog and robotic arm, with advanced functionalities such as voice control, sim2real, real2sim, target recognition and tracking, autonomous grasping, and reinforcement learning gait control [2][5]. - The robotic dog measures 25cm x 20cm x 30cm and weighs 7.0kg, with a maximum speed of 7.2 km/h and a maximum rotation speed of 450 degrees per second [3][11]. - The main control chip is Allwinner H616, featuring a quad-core 1.6GHz CPU, 4GB RAM, and 32GB storage [4][5]. Group 2: Technical Specifications - The robotic dog has a battery capacity of 93.24Wh, providing approximately 120 minutes of operational time and a standby time of about 6 hours [5][11]. - The robotic arm can reach a maximum height of 0.85m and has a grasping range of 0.4m around its base [7]. - The depth camera features active dual infrared and structured light technology, with a depth output resolution of 1280 × 800 @ 30 fps and a working distance of 0.2m - 10m [14]. Group 3: Software and Functionality - The system supports various control methods including voice control, keyboard control, visual control, and reinforcement learning for autonomous movement [15][17]. - Development is based on ROS1, with Python as the primary programming language, and it is recommended to use a GPU of at least 2080ti for inference [16][24]. - The platform allows for advanced functionalities such as collaborative control of the robotic arm and dog for target following, and autonomous grasping capabilities [19][20]. Group 4: Educational Curriculum - The curriculum includes hands-on training in ROS project creation, Mujoco simulation, and reinforcement learning principles, among other topics [22][23]. - Courses cover the setup and usage of the Xdog system, including network configuration, camera parameter adjustments, and advanced algorithms for object recognition and tracking [22][23]. - The teaching team consists of experienced instructors responsible for project management, technical support, and algorithm training [22]. Group 5: Delivery and Support - The delivery cycle is set to be completed within three weeks after payment, with a one-year warranty for after-sales service [25][26]. - The product includes hardware and accompanying courses, with no returns or exchanges allowed for non-quality issues [26].