强化学习
Search documents
“最强具身VLA大模型”,究竟强在哪儿?
3 6 Ke· 2025-11-20 07:38
Core Insights - The core contribution of the π*0.6 model lies in its introduction of a more intuitive learning method called RECAP, which allows robots to learn from their mistakes rather than merely imitating correct actions [3][8][24] - The model demonstrates a high success rate of over 90% in tasks such as making espresso, folding clothes, and assembling packaging boxes, showcasing its practical capabilities [1][20] Group 1: RECAP Methodology - RECAP consists of three main phases: offline reinforcement learning (RL) using diverse demonstration data, fine-tuning with human guidance, and online execution where robots learn from sparse rewards and expert corrections [10][20] - The methodology leverages a value function to evaluate actions and an advantage-conditioned strategy to update policies, allowing for efficient learning from both successful and unsuccessful experiences [13][16][42] Group 2: Model Architecture and Performance - The π*0.6 model builds upon previous versions, expanding its backbone from Gemma (2.6 billion parameters) to Gemma3 (4 billion parameters), and increasing Action Expert parameters to 860 million [20] - In challenging tasks, RECAP has doubled the throughput (successful task completions per hour) and reduced failure rates by approximately 50% compared to models that only utilized supervised fine-tuning [20] Group 3: Learning from Mistakes - The RECAP approach emphasizes the importance of learning from errors, enabling robots to recover from mistakes through expert intervention and self-correction, which is crucial for real-world applications [24][28] - By utilizing a value function to assess the quality of actions, the model can identify key steps and sources of errors, enhancing its ability to adapt and improve in complex environments [39][41]
Z Tech | LMSYS 团队发布大规模 MoE 强化学习框架 Miles,不积跬步无以至千里
Z Potentials· 2025-11-20 04:12
Core Insights - The article introduces Miles, a new reinforcement learning framework designed for enterprise-level large-scale MoE training and production workloads, developed by the LMSYS team as a fork of the lightweight framework slime [1][4]. Group 1: Framework Features - Miles inherits the lightweight and modular design principles of slime, making it a preferred tool for model scientists exploring algorithms [3]. - It implements Infrastructure-level True On-Policy to eliminate discrepancies between training and inference, achieving bit-wise consistency [5]. - The framework introduces speculative training through MTP Online Training, resulting in over 25% rollout acceleration [3][9]. Group 2: Memory Optimization - Miles incorporates advanced memory management techniques to maximize GPU performance without triggering out-of-memory (OOM) errors [8]. - It features online SFT for Draft Models, which enhances performance by preventing a decline in acceptance length during training [9]. - The framework includes mechanisms to avoid benign OOM errors and implements memory margin strategies to address NCCL-related OOM issues [10]. Group 3: Technical Upgrades - Miles supports full-stack optimization for SGLang and Megatron, ensuring compatibility with rapid iterations in training and inference frameworks [6]. - The modular design allows researchers to easily modify components like algorithms, data, sampling, and evaluation with minimal code changes [6]. - It provides a user-friendly interface for model scientists, allowing them to adjust important sampling or loss dynamics without delving into lower-level code [6]. Group 4: Future Development - The LMSYS team plans to enhance the FSDP backend for improved stability in large-scale distributed training [14]. - Future developments include independent rollout deployment, additional debugging tools, and formal mathematical verification for SFT/RL scripts [14]. - The roadmap also aims to support next-generation hardware like GB300 and expand capabilities for multi-modal training [18].
聊AI,当然得来量子位MEET大会!
量子位· 2025-11-20 04:09
Core Insights - The article emphasizes the transformative impact of artificial intelligence (AI) on various industries, marking the beginning of a new era in 2025 [1] - The MEET2026 Intelligent Future Conference will focus on cutting-edge technologies and industry advancements related to AI [2][3] - The conference theme "Symbiosis Without Boundaries, Intelligence to Ignite the Future" highlights AI's role in driving societal evolution [3] Event Details - The conference will cover hot topics in the tech circle, including reinforcement learning, multimodal AI, chip computing power, AI in various industries, and AI going global [4] - It will feature a blend of academic frontiers and commercial applications, showcasing leading technological achievements from infrastructure, models, and products [5] - The event will also include the authoritative release of the annual AI rankings and trends report [6] Notable Speakers - The conference will host prominent figures such as Zhang Yaqin, a renowned scientist and entrepreneur in AI and digital video [12][13] - Sun Maosong, Executive Vice President of the Tsinghua University AI Research Institute, will also be a key speaker [17] - Other notable speakers include Wang Zhongyuan, Zhao Junbo, and Liu Fanping, all of whom have significant contributions to AI research and applications [21][27][48] AI Rankings and Trends Report - The "Artificial Intelligence Annual Rankings" initiated by Quantum Bit has become one of the most influential lists in the AI industry, evaluating companies, products, and individuals [60] - The "2025 Annual AI Trends Report" will identify and analyze ten major AI trends, focusing on technological maturity, current applications, and potential value [61] Conference Logistics - The MEET2026 Intelligent Future Conference will take place at the Beijing Jinmao Renaissance Hotel, with registration now open for attendees [62] - The event aims to attract thousands of tech professionals and millions of online viewers, establishing itself as a significant annual technology business summit [64]
从纯小白到具身算法工程师的打怪之路
具身智能之心· 2025-11-20 04:02
今天有个老学员,拿到了某头部的offer,自笑到从纯小白到算法工程师的打怪之路着实不简单,但真的有 门路。从自己购买so-100折腾,到后面跟着系统的路线一起学习,不仅节省了很多时间,也避免陷入了较 多的坑里。 这里也为大家推荐几个具身方向的研究路线:涉及vla、vln、diffusion policy、强化学习等。也欢迎扫码直 接学习: vla方向 VLA构成的机器人系统主要包括:视觉的感知处理模块,语言指令的理解以及生成机器人可执行动作的策 略网络。根据不同的需求,目前的VLA主要分为三类范式:显示端到到VLA,隐式端到端VLA以及分层端 到端VLA。 显示端到到VLA,是最常见最经典的范式。通常是将视觉语言信息压缩成联合的表征,然后再基于这个表 征去重新映射到动作空间,生成对应的动作。这类端到端的范式依赖于先前广泛的研究先验,通过不同架 构(diffusion/ transformer/dit),不同的模型大小,不同的应用场景(2d/3d),不同的任务需求(从头训/下 游微调),产生了各类不同的方案,取得了不错的性能。 隐式端到端VLA,则不同于前者,更加关注工作的可解释性,旨在利用当前的video d ...
AI的下一步:强化学习是正确的AGI解法吗?|硅谷101年度线下大会|Alignment 2025
硅谷101· 2025-11-20 03:56
【硅谷101年度线下大会回放】2016年,AlphaGo击败围棋世界冠军,让强化学习一战成名。如今,从推荐算法到自动驾驶,强化学习已成为推动AI向AGI进化的第二引擎。然而,其效率低下与自身缺陷等问题,也遭到了包括OpenAI联合创始人Andrej Karpathy等专家们的质疑。 今年的硅谷101 Alignment大会的强化学习专题论坛上,我们邀请到了来自OpenAI、亚马逊、前Meta以及LinkedIn的四位重量级嘉宾,围绕RLVR(基于可验证奖励的强化学习)、人类反馈数据的“黄金标准”、探索与抽象以及被称为“强化学习之父”的 “OaK” 架构等前沿议题,展开了一场极其坦诚、也极其硬核的讨论。他们眼中强化学习的极限在哪里?最终,AI能否凭借强化学习,走向真正的知识创新? 硅谷101于2025年10月5日在硅谷线下举办的Alignment2025年度科技大会上,不少演讲嘉宾分享了极具价值的观点,我们将会把一些重要观点逐渐整理上线。我们的线下大会是全英文,嘉宾的分享将用中文字幕的方式呈现。 圆桌嘉宾: 朱哲清(主持人):Pokee.ai创始人、前Meta AI应用强化学习负责人 Lihong Li:亚马逊 ...
蚂蚁开源万亿参数强化学习高性能权重交换框架Awex
Mei Ri Jing Ji Xin Wen· 2025-11-20 01:51
(文章来源:每日经济新闻) 每经AI快讯,11月20日,蚂蚁集团宣布开源万亿参数强化学习高性能权重交换框架Awex。 ...
聊AI,当然得来量子位MEET大会!
量子位· 2025-11-19 06:20
Core Insights - The article emphasizes the transformative impact of artificial intelligence (AI) on various industries, marking the beginning of a new era in 2025 [1] - The MEET2026 Intelligent Future Conference will focus on cutting-edge technologies and industry advancements related to AI [2][3] - The conference theme "Symbiosis Without Boundaries, Intelligence to Ignite the Future" highlights AI's role in driving societal evolution [3] Event Details - The conference will cover hot topics in the tech circle, including reinforcement learning, multimodal AI, chip computing power, AI in various industries, and AI going global [4] - It will feature a collision of academic frontiers and commercial applications, showcasing leading technological achievements from infrastructure, models, and products [5] - The event will also include the authoritative release of the annual AI rankings and trends report [6] Notable Speakers - The conference will host prominent figures such as Zhang Yaqin, a world-class scientist and entrepreneur in AI and digital video [12][13] - Sun Maosong, Executive Vice President of Tsinghua University's AI Research Institute, will also be a key speaker [17] - Other notable speakers include Wang Zhongyuan, Zhao Junbo, and Liu Fanping, all recognized for their contributions to AI and technology [21][27][48] AI Rankings and Trends Report - The "Artificial Intelligence Annual Rankings" initiated by Quantum Bit has become one of the most influential rankings in the AI industry, evaluating companies, products, and individuals [60] - The "2025 Annual AI Trends Report" will focus on the main themes of technological development, identifying ten significant AI trends and analyzing their potential value [61] Conference Logistics - The MEET2026 Intelligent Future Conference will take place at the Beijing Jinmao Renaissance Hotel, with registration now open for attendees [62] - The event aims to attract thousands of tech professionals and millions of online viewers, establishing itself as an annual barometer for the intelligent technology industry [64]
Physical Intelligence团队正式发布π*0.6!VLA+强化学习训练
具身智能之心· 2025-11-19 00:34
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Physical Intelligence团队 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 11月17号!Physical Intelligence团队正式发布 ,从经验中学习的VLA。 项目链接:https://www.pi.website/blog/pistar06 论文链接:https://www.pi.website/download/pistar06.pdf VLA模型如何通过强化学习在现实部署中实现自我改进? 提出了一种通用方法RECAP:基于经验与校正的优势条件策略强化学习,该方法通过优势条件机制 实现VLA模型的强化学习训练。 该方法将异构数据整合到自我改进过程中,包括演示数据、在线收集数据以及在自主执行期间专家远程干预数据。RECAP方法首先通过离线强化学习预训练通用型 VLA模型(记为 ),该模型随后可通过机器人现场数据收集实现下游任务的专业化性能提升。 实验表明 ...
端到端和VLA的岗位,薪资高的离谱......
自动驾驶之心· 2025-11-19 00:03
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack involved in end-to-end and VLA is complex, covering various advanced algorithms and models such as BEV perception, VLM (Vision-Language Model), diffusion models, reinforcement learning, and world models [2] Course Offerings - The company is launching two specialized courses: "End-to-End and VLA Autonomous Driving Class" and "Practical Course on VLA and Large Models," aimed at helping individuals quickly and efficiently enter the field of end-to-end and VLA technologies [2] - The "Practical Course on VLA and Large Models" focuses on VLA, covering topics from VLM as an autonomous driving interpreter to modular and integrated VLA, including mainstream inference-enhanced VLA [2] - The course includes a detailed theoretical foundation and practical assignments, teaching participants how to build their own VLA models and datasets from scratch [2] Instructor Team - The instructor team consists of experts from both academia and industry, including individuals with extensive research and practical experience in multi-modal perception, autonomous driving VLA, and large model frameworks [7][10][13] - Notable instructors include a Tsinghua University master's graduate with multiple publications in top conferences and a current algorithm expert at a leading domestic OEM [7][13] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and who have a grasp of concepts related to transformer large models, reinforcement learning, and BEV perception [15] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [15]
Physical Intelligence团队正式发布π*0.6
自动驾驶之心· 2025-11-19 00:03
Core Insights - The article discusses the release of the VLA model by the Physical Intelligence team, which utilizes a novel reinforcement learning method called RECAP to enhance self-improvement in real-world deployments [2][4][10]. Summary by Sections Introduction to VLA and RECAP - The VLA model is designed to learn from experience and improve its performance through a method called RECAP, which integrates heterogeneous data sources including demonstration data, online collected data, and expert intervention during autonomous execution [4][7]. Methodology - RECAP combines offline reinforcement learning for pre-training the VLA model and utilizes data collected during deployment for further training. This method aims to enhance the model's robustness and operational efficiency by integrating feedback from various sources [7][10][11]. Training Process - The training process involves three main steps: data collection, value function training, and advantage conditioned training. These steps are repeated to optimize the VLA model [11][12][13]. - Data collection involves running the VLA model on tasks and labeling results to determine reward values, with the option for human intervention to correct early errors [12]. - The value function is trained using all collected data to detect faults and estimate the time required for task completion [13][19]. - Advantage conditioned training improves the VLA strategy by incorporating optimality metrics derived from the value function [13][19]. Applications and Performance - The RECAP method has been successfully applied to complex tasks such as folding clothes, assembling boxes, and making espresso coffee. The model demonstrated significant performance improvements, achieving over two times the throughput and reducing failure rates by approximately 50% in challenging tasks [10][28][30]. - The model's robustness was validated through real-world deployments, where it successfully operated for extended periods without interruption [10][30]. Experimental Analysis - The article details various tasks evaluated during experiments, including clothing folding, coffee making, and box assembly, with specific success criteria for each task [23][24][25]. - Results showed that the RECAP method significantly enhanced both the throughput and success rates across all tasks, with the most notable improvements in diverse clothing folding and coffee making tasks [28][30][32]. Future Directions - The article identifies areas for improvement in the RECAP system, including the need for automation in reward feedback and intervention processes, as well as the exploration of more sophisticated exploration mechanisms [36]. - It also suggests that transitioning to a fully online reinforcement learning framework could enhance the efficiency of the VLA training process [36].