强化学习
Search documents
Z Tech | LMSYS 团队发布大规模 MoE 强化学习框架 Miles,不积跬步无以至千里
Z Potentials· 2025-11-20 04:12
Core Insights - The article introduces Miles, a new reinforcement learning framework designed for enterprise-level large-scale MoE training and production workloads, developed by the LMSYS team as a fork of the lightweight framework slime [1][4]. Group 1: Framework Features - Miles inherits the lightweight and modular design principles of slime, making it a preferred tool for model scientists exploring algorithms [3]. - It implements Infrastructure-level True On-Policy to eliminate discrepancies between training and inference, achieving bit-wise consistency [5]. - The framework introduces speculative training through MTP Online Training, resulting in over 25% rollout acceleration [3][9]. Group 2: Memory Optimization - Miles incorporates advanced memory management techniques to maximize GPU performance without triggering out-of-memory (OOM) errors [8]. - It features online SFT for Draft Models, which enhances performance by preventing a decline in acceptance length during training [9]. - The framework includes mechanisms to avoid benign OOM errors and implements memory margin strategies to address NCCL-related OOM issues [10]. Group 3: Technical Upgrades - Miles supports full-stack optimization for SGLang and Megatron, ensuring compatibility with rapid iterations in training and inference frameworks [6]. - The modular design allows researchers to easily modify components like algorithms, data, sampling, and evaluation with minimal code changes [6]. - It provides a user-friendly interface for model scientists, allowing them to adjust important sampling or loss dynamics without delving into lower-level code [6]. Group 4: Future Development - The LMSYS team plans to enhance the FSDP backend for improved stability in large-scale distributed training [14]. - Future developments include independent rollout deployment, additional debugging tools, and formal mathematical verification for SFT/RL scripts [14]. - The roadmap also aims to support next-generation hardware like GB300 and expand capabilities for multi-modal training [18].
聊AI,当然得来量子位MEET大会!
量子位· 2025-11-20 04:09
Core Insights - The article emphasizes the transformative impact of artificial intelligence (AI) on various industries, marking the beginning of a new era in 2025 [1] - The MEET2026 Intelligent Future Conference will focus on cutting-edge technologies and industry advancements related to AI [2][3] - The conference theme "Symbiosis Without Boundaries, Intelligence to Ignite the Future" highlights AI's role in driving societal evolution [3] Event Details - The conference will cover hot topics in the tech circle, including reinforcement learning, multimodal AI, chip computing power, AI in various industries, and AI going global [4] - It will feature a blend of academic frontiers and commercial applications, showcasing leading technological achievements from infrastructure, models, and products [5] - The event will also include the authoritative release of the annual AI rankings and trends report [6] Notable Speakers - The conference will host prominent figures such as Zhang Yaqin, a renowned scientist and entrepreneur in AI and digital video [12][13] - Sun Maosong, Executive Vice President of the Tsinghua University AI Research Institute, will also be a key speaker [17] - Other notable speakers include Wang Zhongyuan, Zhao Junbo, and Liu Fanping, all of whom have significant contributions to AI research and applications [21][27][48] AI Rankings and Trends Report - The "Artificial Intelligence Annual Rankings" initiated by Quantum Bit has become one of the most influential lists in the AI industry, evaluating companies, products, and individuals [60] - The "2025 Annual AI Trends Report" will identify and analyze ten major AI trends, focusing on technological maturity, current applications, and potential value [61] Conference Logistics - The MEET2026 Intelligent Future Conference will take place at the Beijing Jinmao Renaissance Hotel, with registration now open for attendees [62] - The event aims to attract thousands of tech professionals and millions of online viewers, establishing itself as a significant annual technology business summit [64]
从纯小白到具身算法工程师的打怪之路
具身智能之心· 2025-11-20 04:02
Core Insights - The article discusses the evolution and research directions in Visual Language Action (VLA), Visual Language Navigation (VLN), and reinforcement learning in robotics, highlighting the importance of these technologies in enhancing robot capabilities and performance [1][2][5][9]. VLA Direction - VLA systems consist of visual perception processing, language instruction understanding, and action strategy networks, categorized into three paradigms: explicit end-to-end VLA, implicit end-to-end VLA, and hierarchical end-to-end VLA [1][2]. - Explicit end-to-end VLA compresses visual and language information into a joint representation, which is then mapped to action space, leveraging various architectures and models to achieve good performance [1]. - Implicit end-to-end VLA focuses on interpretability by predicting future states using video diffusion models, enhancing the potential for scaling VLA models [2]. - Hierarchical end-to-end VLA aims to utilize the characteristics of large models to improve generalization while maintaining efficiency for downstream execution [2]. VLN Direction - VLN systems are composed of visual language encoders, environmental history representation, and action strategies, requiring effective information compression from visual and language inputs [5][6]. - The choice of encoder and whether to project visual and language representations into a common space are critical issues, with current trends favoring pre-trained models on large datasets and the use of large language models (LLM) for instruction decomposition [6]. - VLN robots operate in a sequential decision-making task, accumulating historical information to inform future actions, with implicit methods representing past information as latent variables [6]. - Object Navigation within VLN emphasizes identifying target objects based on category information, reducing the need for detailed instructions and enhancing exploration capabilities [7]. Reinforcement Learning & Legged Robots - Reinforcement learning is crucial for legged robots, covering various aspects such as kinematics, dynamics, multi-modal sensor fusion, and advanced algorithms for task adaptation [9][10]. - Key areas include gait planning, balance control for bipedal robots, and the application of deep reinforcement learning and imitation learning for multi-task training [10]. - Techniques like domain randomization and safety mechanisms are essential for ensuring successful real-world deployment of robotic systems [10]. Diffusion Policy - The introduction of diffusion models in robotics has led to significant advancements, with the Diffusion Policy achieving an average performance improvement of 46.9% in various simulation environments [21][22]. - The Robotic Diffusion Transformer (RDT), with 1.2 billion parameters, showcases strong zero-shot generalization capabilities and the ability to learn new skills with minimal examples [22]. - The application of diffusion strategies is expanding beyond robotic manipulation to areas like autonomous navigation and dexterous grasping, enhancing task success rates through real-time environmental adaptation [22][23]. - Recent developments in diffusion strategies include advancements in 3D applications and the integration of safety and online reinforcement learning, opening new research avenues [23].
AI的下一步:强化学习是正确的AGI解法吗?|硅谷101年度线下大会|Alignment 2025
硅谷101· 2025-11-20 03:56
【硅谷101年度线下大会回放】2016年,AlphaGo击败围棋世界冠军,让强化学习一战成名。如今,从推荐算法到自动驾驶,强化学习已成为推动AI向AGI进化的第二引擎。然而,其效率低下与自身缺陷等问题,也遭到了包括OpenAI联合创始人Andrej Karpathy等专家们的质疑。 今年的硅谷101 Alignment大会的强化学习专题论坛上,我们邀请到了来自OpenAI、亚马逊、前Meta以及LinkedIn的四位重量级嘉宾,围绕RLVR(基于可验证奖励的强化学习)、人类反馈数据的“黄金标准”、探索与抽象以及被称为“强化学习之父”的 “OaK” 架构等前沿议题,展开了一场极其坦诚、也极其硬核的讨论。他们眼中强化学习的极限在哪里?最终,AI能否凭借强化学习,走向真正的知识创新? 硅谷101于2025年10月5日在硅谷线下举办的Alignment2025年度科技大会上,不少演讲嘉宾分享了极具价值的观点,我们将会把一些重要观点逐渐整理上线。我们的线下大会是全英文,嘉宾的分享将用中文字幕的方式呈现。 圆桌嘉宾: 朱哲清(主持人):Pokee.ai创始人、前Meta AI应用强化学习负责人 Lihong Li:亚马逊 ...
蚂蚁开源万亿参数强化学习高性能权重交换框架Awex
Mei Ri Jing Ji Xin Wen· 2025-11-20 01:51
(文章来源:每日经济新闻) 每经AI快讯,11月20日,蚂蚁集团宣布开源万亿参数强化学习高性能权重交换框架Awex。 ...
聊AI,当然得来量子位MEET大会!
量子位· 2025-11-19 06:20
Core Insights - The article emphasizes the transformative impact of artificial intelligence (AI) on various industries, marking the beginning of a new era in 2025 [1] - The MEET2026 Intelligent Future Conference will focus on cutting-edge technologies and industry advancements related to AI [2][3] - The conference theme "Symbiosis Without Boundaries, Intelligence to Ignite the Future" highlights AI's role in driving societal evolution [3] Event Details - The conference will cover hot topics in the tech circle, including reinforcement learning, multimodal AI, chip computing power, AI in various industries, and AI going global [4] - It will feature a collision of academic frontiers and commercial applications, showcasing leading technological achievements from infrastructure, models, and products [5] - The event will also include the authoritative release of the annual AI rankings and trends report [6] Notable Speakers - The conference will host prominent figures such as Zhang Yaqin, a world-class scientist and entrepreneur in AI and digital video [12][13] - Sun Maosong, Executive Vice President of Tsinghua University's AI Research Institute, will also be a key speaker [17] - Other notable speakers include Wang Zhongyuan, Zhao Junbo, and Liu Fanping, all recognized for their contributions to AI and technology [21][27][48] AI Rankings and Trends Report - The "Artificial Intelligence Annual Rankings" initiated by Quantum Bit has become one of the most influential rankings in the AI industry, evaluating companies, products, and individuals [60] - The "2025 Annual AI Trends Report" will focus on the main themes of technological development, identifying ten significant AI trends and analyzing their potential value [61] Conference Logistics - The MEET2026 Intelligent Future Conference will take place at the Beijing Jinmao Renaissance Hotel, with registration now open for attendees [62] - The event aims to attract thousands of tech professionals and millions of online viewers, establishing itself as an annual barometer for the intelligent technology industry [64]
Physical Intelligence团队正式发布π*0.6!VLA+强化学习训练
具身智能之心· 2025-11-19 00:34
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Physical Intelligence团队 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 11月17号!Physical Intelligence团队正式发布 ,从经验中学习的VLA。 项目链接:https://www.pi.website/blog/pistar06 论文链接:https://www.pi.website/download/pistar06.pdf VLA模型如何通过强化学习在现实部署中实现自我改进? 提出了一种通用方法RECAP:基于经验与校正的优势条件策略强化学习,该方法通过优势条件机制 实现VLA模型的强化学习训练。 该方法将异构数据整合到自我改进过程中,包括演示数据、在线收集数据以及在自主执行期间专家远程干预数据。RECAP方法首先通过离线强化学习预训练通用型 VLA模型(记为 ),该模型随后可通过机器人现场数据收集实现下游任务的专业化性能提升。 实验表明 ...
端到端和VLA的岗位,薪资高的离谱......
自动驾驶之心· 2025-11-19 00:03
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack involved in end-to-end and VLA is complex, covering various advanced algorithms and models such as BEV perception, VLM (Vision-Language Model), diffusion models, reinforcement learning, and world models [2] Course Offerings - The company is launching two specialized courses: "End-to-End and VLA Autonomous Driving Class" and "Practical Course on VLA and Large Models," aimed at helping individuals quickly and efficiently enter the field of end-to-end and VLA technologies [2] - The "Practical Course on VLA and Large Models" focuses on VLA, covering topics from VLM as an autonomous driving interpreter to modular and integrated VLA, including mainstream inference-enhanced VLA [2] - The course includes a detailed theoretical foundation and practical assignments, teaching participants how to build their own VLA models and datasets from scratch [2] Instructor Team - The instructor team consists of experts from both academia and industry, including individuals with extensive research and practical experience in multi-modal perception, autonomous driving VLA, and large model frameworks [7][10][13] - Notable instructors include a Tsinghua University master's graduate with multiple publications in top conferences and a current algorithm expert at a leading domestic OEM [7][13] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and who have a grasp of concepts related to transformer large models, reinforcement learning, and BEV perception [15] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [15]
Physical Intelligence团队正式发布π*0.6
自动驾驶之心· 2025-11-19 00:03
Core Insights - The article discusses the release of the VLA model by the Physical Intelligence team, which utilizes a novel reinforcement learning method called RECAP to enhance self-improvement in real-world deployments [2][4][10]. Summary by Sections Introduction to VLA and RECAP - The VLA model is designed to learn from experience and improve its performance through a method called RECAP, which integrates heterogeneous data sources including demonstration data, online collected data, and expert intervention during autonomous execution [4][7]. Methodology - RECAP combines offline reinforcement learning for pre-training the VLA model and utilizes data collected during deployment for further training. This method aims to enhance the model's robustness and operational efficiency by integrating feedback from various sources [7][10][11]. Training Process - The training process involves three main steps: data collection, value function training, and advantage conditioned training. These steps are repeated to optimize the VLA model [11][12][13]. - Data collection involves running the VLA model on tasks and labeling results to determine reward values, with the option for human intervention to correct early errors [12]. - The value function is trained using all collected data to detect faults and estimate the time required for task completion [13][19]. - Advantage conditioned training improves the VLA strategy by incorporating optimality metrics derived from the value function [13][19]. Applications and Performance - The RECAP method has been successfully applied to complex tasks such as folding clothes, assembling boxes, and making espresso coffee. The model demonstrated significant performance improvements, achieving over two times the throughput and reducing failure rates by approximately 50% in challenging tasks [10][28][30]. - The model's robustness was validated through real-world deployments, where it successfully operated for extended periods without interruption [10][30]. Experimental Analysis - The article details various tasks evaluated during experiments, including clothing folding, coffee making, and box assembly, with specific success criteria for each task [23][24][25]. - Results showed that the RECAP method significantly enhanced both the throughput and success rates across all tasks, with the most notable improvements in diverse clothing folding and coffee making tasks [28][30][32]. Future Directions - The article identifies areas for improvement in the RECAP system, including the need for automation in reward feedback and intervention processes, as well as the exploration of more sophisticated exploration mechanisms [36]. - It also suggests that transitioning to a fully online reinforcement learning framework could enhance the efficiency of the VLA training process [36].
Physical Intelligence最新发布的VLA模型,为什么是机器人通往规模化部署的拐点?|Jinqiu Select
锦秋集· 2025-11-18 11:13
Core Insights - The article discusses the limitations of current robot foundational models that primarily rely on demonstration data, highlighting the need for a structured reinforcement learning (RL) framework called Recap to enhance robot performance and reliability [2][3][10]. Group 1: Limitations of Current Models - Current models depend heavily on demonstration data, which incurs high human costs and limits the strategies to human-level performance, lacking self-improvement capabilities [2][10]. - The article emphasizes that merely increasing model size is insufficient; a restructured training paradigm is essential for robots to transition from "can demonstrate" to "can deploy at scale" [3][10]. Group 2: Introduction of Recap Framework - Recap integrates three training phases: demonstration, correction, and robot autonomous rollouts, allowing for continuous improvement in strategy quality [2][10]. - The framework addresses the compounding error problem in robot strategies by systematically utilizing correction data, value functions, and advantages [3][10][12]. Group 3: Performance of π*(0.6) Model - The π*(0.6) model, with 5 billion parameters, demonstrates the ability to handle heterogeneous prompts and achieve performance thresholds suitable for commercial deployment [3][20]. - The model shows significant improvements in task execution, achieving over 90% success rates in complex tasks such as making espresso, folding clothes, and assembling boxes [25][20]. Group 4: Learning Process and Challenges - The learning process involves three stages: offline reinforcement learning pre-training, task-specific fine-tuning, and continuous improvement through real-world experience [19][20]. - The article outlines the challenges faced in high-throughput, autonomous execution, particularly in tasks requiring complex physical operations and adaptability to various conditions [24][20]. Group 5: Data Sources for Learning - The article identifies three data sources for robot learning: expert demonstrations for defining new behaviors, guidance for refining strategies, and autonomous experience for behavior enhancement [27][28]. - It posits that autonomous experience may become a crucial data source as robots are deployed more widely in real-world applications, potentially enabling performance that surpasses human capabilities [27][28].