Workflow
强化学习
icon
Search documents
西湖大学提出RDPO强化学习框架,实现扩散模型并行推理加速
量子位· 2026-01-13 07:21
非羊 整理自 凹非寺 量子位 | 公众号 QbitAI 用扩散模型 (比如Stable Diffusion) 一张张"挤"出高分辨率图像的时代,正在被世界模型实时生成高清视频的浪潮冲刷。 但无论图像还是视频,扩散模型骨子里的"顺序去噪"过程,就像一场无法并行的接力赛,成为速度提升的终极瓶颈。 如何在不伤及模型"绘画功力"的前提下,为它装上加速引擎? 西湖大学AGI Lab提出的 RDPO(残差狄利克雷策略优化)框架 ,给出了一种巧妙的答案: 不必改动模型本身,而是优化它的"采样导航 系统" 。 重要的是,由于额外的梯度计算是 独立 的,它们可以完全 并行化 ,从而保持 低延迟采样 的特性。 团队引入了一个 两阶段优化框架 :最初,EPD-Solver通过基于 蒸馏 的方法优化一小组可学习参数;随后,团队进一步提出了一种参数高 效的强化学习微调框架 RDPO ,将求解器重新构建为随机的狄利克雷 (Dirichlet) 策略。 与微调庞大骨干网络的传统方法不同,团队的RL方法严格在 低维求解器空间 内运行,在增强复杂文本到图像 (T2I) 生成任务性能的同 时,有效缓解了奖励作弊 (Reward Hacking) ...
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
具身智能之心· 2026-01-13 00:54
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 GRPO 是促使 DeepSeek-R1 成功的基础技术之一。最近一两年,GRPO 及其变体因其高效性和简洁性,已成为业内广泛采用的强化学习算法。 但随着语言模型能力的不断提升,用户对它们的期待也在发生变化:不仅要回答正确,还要在各种不同场景下表现出符合多样化人类偏好的行为。为此, 强化学 习训练流程开始引入多种奖励信号 ,每一种奖励对应一种不同的偏好,用来共同引导模型走向理想的行为模式。 但英伟达的一篇新论文却指出,在进行多奖励优化时,GRPO 可能不是最佳选择。 具体来说,在多奖励优化场景中,GRPO 会将不同的奖励组合归一化为相同的优势值。这会削弱训练信号,降低奖励水平。 为了解决这一问题,他们提出了一种新的策略优化方法 —— 组奖励解耦归一化策略优化( GDPO )。该方法通过对各个奖励信号分别进行归一化,避免了不同奖 励之间被混合「抹平」,从而更真实地保留它们的相对差异,使多奖励优化更加准确,同时显著提升了训练过程的稳定性。 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心 ...
欢迎和具身智能之心一起前行,合伙人招募啦~
具身智能之心· 2026-01-12 11:00
Core Insights - The company is seeking to empower partners through online and offline training, consulting, data collection, and technology upgrades [1] - There is an invitation for global practitioners in the embodied intelligence field to collaborate in various areas such as technical services, training, course development, and research guidance [1] Major Directions - The focus areas for collaboration include but are not limited to: VLA, VLN, Diffusion Policy, Reinforcement Learning, VLA+RL, remote operation, motion capture, sim2real, multimodal large models, simulation, motion control, end-to-end systems, and 3D perception [3] Job Description - The positions are primarily aimed at embodied solution development, hardware development, and training collaboration, targeting B-end (business and educational institutions) and C-end (students and job seekers) [4] Contact Information - Interested parties can add WeChat oooops-life for further inquiries [5]
最近会开放一批端到端&VLA的岗位需求
自动驾驶之心· 2026-01-12 03:15
Core Insights - The consensus among industry experts indicates that 2026 will be a pivotal year for the development of end-to-end (E2E) and VLA (Vision-Language Alignment) technologies in autonomous driving, with a focus on optimizing production processes rather than making significant algorithmic changes [1] - The industry is actively recruiting experienced algorithm engineers and developing talent to tackle the complex challenges ahead, particularly in areas such as BEV perception, large models, diffusion models, and reinforcement learning [1] Course Overview - The course on E2E and VLA autonomous driving is designed to provide a comprehensive learning path from principles to practical applications, developed in collaboration with industry leaders [3] - The course covers various aspects of E2E algorithms, including their historical development, advantages and disadvantages of different paradigms, and current trends in both academia and industry [6][7] - Key technical keywords that are expected to be frequently encountered in job interviews over the next two years are emphasized in the course content [7] Course Structure - Chapter 1 introduces the concept of E2E algorithms, discussing their evolution from modular approaches to current paradigms like VLA [6] - Chapter 2 focuses on the background knowledge necessary for understanding E2E technologies, including VLA, large language models, diffusion models, and reinforcement learning [11] - Chapter 3 delves into two-stage E2E algorithms, exploring their emergence and comparing them with one-stage approaches [7] - Chapter 4 presents one-stage E2E algorithms and VLA, highlighting various subfields and their contributions to achieving the ultimate goals of E2E systems [8] - Chapter 5 involves a practical assignment on RLHF (Reinforcement Learning from Human Feedback) fine-tuning, demonstrating how to build and experiment with pre-training and reinforcement learning modules [9] Learning Outcomes - The course aims to elevate participants to the level of an E2E autonomous driving algorithm engineer within approximately one year, covering a wide range of methodologies including one-stage, two-stage, world models, and diffusion models [15] - Participants will gain a deeper understanding of key technologies such as BEV perception, multimodal large models, reinforcement learning, and diffusion models, enabling them to apply their knowledge in real-world projects [15]
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
机器之心· 2026-01-11 04:00
但随着语言模型能力的不断提升,用户对它们的期待也在发生变化:不仅要回答正确,还要在各种不同场景下表现出符合多样化人类偏好的行为。为此, 强化学 习训练流程开始引入多种奖励信号 ,每一种奖励对应一种不同的偏好,用来共同引导模型走向理想的行为模式。 但英伟达的一篇新论文却指出,在进行多奖励优化时,GRPO 可能不是最佳选择。 具体来说,在多奖励优化场景中,GRPO 会将不同的奖励组合归一化为相同的优势值。这会削弱训练信号,降低奖励水平。 为了解决这一问题,他们提出了一种新的策略优化方法 —— 组奖励解耦归一化策略优化( GDPO )。该方法通过对各个奖励信号分别进行归一化,避免了不同奖 励之间被混合「抹平」,从而更真实地保留它们的相对差异,使多奖励优化更加准确,同时显著提升了训练过程的稳定性。 机器之心编辑部 GRPO 是促使 DeepSeek-R1 成功的基础技术之一。最近一两年,GRPO 及其变体因其高效性和简洁性,已成为业内广泛采用的强化学习算法。 论文标题:GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-re ...
加州大学伯克利Dr. Allen Yang:物理AI的分水岭时刻尚未到来|CES 2026
Tai Mei Ti A P P· 2026-01-10 14:33
Core Insights - The artificial intelligence industry is currently engaged in a "GPU race," with a focus on cloud-based AI applications, but there is a call to shift attention towards physical AI and its potential breakthrough moments [1][5][16] - Dr. Allen Yang emphasizes that while AI has made significant strides with models like AlphaGo, physical AI is still awaiting its own "watershed moment" due to unique challenges rooted in the complexities of the physical world [2][6][12] Group 1: Challenges in Physical AI - Physical AI lacks comprehensive training data for extreme scenarios, unlike language models that can leverage vast internet data [2][13] - Real-time decision-making with millisecond-level latency is critical for applications like autonomous driving, where delays can lead to failures [2][14] - Many cutting-edge scenarios lack reliable cloud connectivity, necessitating the use of edge AI deployed locally on devices [2][15] Group 2: Innovations and Competitions - The Berkeley AI Racing Team has achieved significant milestones, including a top speed of 163 miles per hour in autonomous driving competitions, showcasing the need for complex real-time perception and planning [4][18] - The upcoming Tianmen Mountain humanoid robot challenge aims to test robots' mobility and decision-making in unstructured terrains, further pushing the boundaries of physical AI [4][29] - The collaboration with nine universities in China for the Tianmen Mountain challenge highlights the importance of interdisciplinary cooperation and real-world experience in advancing physical AI [4][26] Group 3: Future Directions - The focus on physical AI is expected to grow, with the potential for new breakthroughs that could redefine the field, similar to past milestones in AI history [2][18][29] - The upcoming competitions and challenges are designed to foster innovation and collaboration among institutions, aiming to discover the next "AlphaGo moment" in physical AI [25][29]
姚顺雨林俊旸杨植麟齐聚,锐评大模型创业与下一代技术范式
第一财经· 2026-01-10 14:21
Core Viewpoint - The article discusses the next generation of AI technology paradigms, particularly focusing on the concept of Autonomous Learning as a potential solution to the limitations of current large models and their reliance on labeled data and offline pre-training [3][4]. Group 1: Autonomous Learning - Autonomous Learning is gaining traction as a method for large models to evolve independently by generating learning signals and optimizing through closed-loop iterations [3]. - The definition and understanding of Autonomous Learning vary among industry experts, indicating a need for context-specific applications [3]. - Current advancements in Autonomous Learning are seen as gradual improvements rather than revolutionary changes, with existing efficiency issues still to be addressed [3]. Group 2: Future Paradigms and Innovations - Experts believe that OpenAI, despite its commercialization challenges, remains a strong candidate for leading the next paradigm shift in AI [4]. - The potential of Reinforcement Learning (RL) is still largely untapped, with the next generation of paradigms expected to emphasize "self-evolution" and "proactivity" [4]. - Concerns about safety arise with the introduction of proactivity in AI, necessitating the instillation of appropriate values and constraints [4]. Group 3: Market Dynamics and Competitive Landscape - The probability of Chinese teams leading in AI innovation in the next three to five years is considered high, given their ability to quickly replicate and improve upon discovered technologies [5]. - Key challenges for China include breakthroughs in lithography technology, capacity, and software ecosystem development [5]. - The maturity of the B2B market and the ability to compete internationally are critical for China's success in AI [5].
姚顺雨林俊旸杨植麟齐聚 锐评大模型创业与下一代技术范式
Di Yi Cai Jing· 2026-01-10 14:06
Core Insights - The next generation of AI technology paradigms is expected to focus on Autonomous Learning, which allows models to evolve independently without heavy reliance on human-annotated data and offline pre-training [1][2] - The potential for innovation in AI is seen as high in China, with the ability to quickly replicate and improve upon discoveries, contingent on breakthroughs in key technologies like lithography machines [3] Group 1: Next Generation Paradigms - Autonomous Learning is a trending concept that enables models to generate learning signals and optimize through closed-loop iterations, leading to continuous evolution [1] - The definition and understanding of Autonomous Learning vary among experts, emphasizing its dependence on specific data and task contexts [1] - Current advancements in AI, such as Claude's ability to self-improve by transforming 95% of its own code, indicate that self-learning is already occurring, albeit with efficiency limitations [1] Group 2: Market Leaders and Innovations - OpenAI is viewed as the most likely candidate to lead the next paradigm shift in AI, despite facing challenges in maintaining its innovative edge [2] - The current Reinforcement Learning (RL) paradigm is still in its early stages, with significant potential yet to be realized, focusing on "autonomous evolution" and "proactivity" [2] - The introduction of proactivity in AI raises new safety concerns, necessitating the instillation of appropriate values and constraints [2] Group 3: China's Position in AI - The probability of Chinese teams leading in AI innovation in the next three to five years is considered high, given their ability to quickly replicate and enhance discoveries [3] - Key challenges for China include production capacity and software ecosystem development, alongside the need for a more mature B2B market [3] - Cultural and economic factors may hinder the willingness to pursue groundbreaking innovations in China [3]
智能体「卷王」诞生!干活自动配结项报告,1.5张截图就把事说清了
量子位· 2026-01-10 03:07
Core Insights - The article discusses the concept of SmartSnap, which transforms GUI agents from passive executors to proactive self-verifiers, enabling them to collect evidence while completing tasks [7][12]. Group 1: Challenges in Current AI Verification - A significant challenge in LLM/VLM-driven agents is the uncertainty of task completion quality after execution [2]. - Existing verification methods require complex manual checks and robust trajectory-level validation, which can be inefficient and contextually noisy [4][5]. - These methods depend on continuous observable feedback, which can fail due to environmental changes [6]. Group 2: SmartSnap Overview - SmartSnap allows agents to actively collect and submit a "snapshot of evidence" while performing tasks, akin to a project completion report [8][9]. - The approach aims to reduce the verification burden on external validators by enabling agents to self-verify their actions [6][19]. Group 3: Key Innovations - SmartSnap introduces a dual mission for agents: executing tasks and self-verifying their completion [11][12]. - The 3C principle (Completeness, Conciseness, Creativity) is established to ensure evidence quality without overwhelming validators [15]. - The training utilizes the GRPO algorithm with intrinsic reward shaping to enhance evidence quality while minimizing reward hacking [14]. Group 4: Performance Improvements - SmartSnap has shown significant performance improvements across various models, with the highest increase reaching 26.08% [17]. - The average task now requires only 1.5 evidence snapshots, greatly reducing validation costs [18]. - Agents trained with SmartSnap demonstrate improved interaction efficiency, leading to fewer interaction rounds [18]. Group 5: Future Implications - The emergence of SmartSnap signifies a shift from brute-force execution to cognitive collaboration in GUI agents, enhancing AI reliability and paving the way for large-scale, low-cost AI deployment [21]. - Future AI systems must not only be capable but also trustworthy, emphasizing the importance of self-verification capabilities [22].
斯坦福最新的全身运控方案,跨地形泛化!
具身智能之心· 2026-01-09 00:55
Core Insights - The article discusses the challenges and advancements in humanoid robot locomotion, emphasizing the need for multi-limb coordination to navigate complex environments effectively [2][3][5]. Research Background and Core Challenges - Traditional humanoid robot movement focuses primarily on legged locomotion, but real-world scenarios require the use of additional body parts for stability and support [2]. - The research identifies two main challenges in humanoid robot locomotion: rich contact motion planning and robust control in complex environments, and the need for flexible skill switching across different terrains [3][5]. Core Methodology - A hierarchical framework combining physics-based keyframe animation and reinforcement learning is proposed, consisting of four main components: keyframe generation, policy training, skill selection, and hierarchical execution [4][5]. Keyframe Motion Generation - The study utilizes a GUI tool based on the MuJoCo physics engine to create keyframe animations that encode human movement knowledge while addressing physical realism and manual tuning costs [7]. - The limitations of keyframes include their open-loop nature, necessitating reinforcement learning to develop adaptive motion tracking strategies [8]. Motion Tracking Strategies - Strategies are categorized into three types, ensuring seamless transitions between four standard postures (standing, crawling, prone, supine) [9]. - The reward function for training includes components for tracking accuracy, energy efficiency, and preventing premature termination of training [10]. Visual Skill Classifier - The system employs a visual skill classifier to autonomously select appropriate movement skills based on environmental perception, categorizing skills into movement, transition, and terrain-specific skills [11]. Hierarchical Policy Execution - The framework separates visual planning from low-level control, enhancing robustness and real-time responsiveness [12]. Experimental Validation - Data collection involved real-world testing with a robot equipped with dual fisheye cameras, and the model was trained using a ResNet classifier to balance computational efficiency and geometric feature capture [15]. - The system demonstrated zero-shot transfer success across various obstacle configurations, validating the effectiveness of the motion tracking strategies [18][23]. Conclusion and Future Directions - The research presents a hybrid framework of keyframes and reinforcement learning, achieving humanoid robot mobility in complex terrains and demonstrating zero-shot transfer capabilities [28]. - Future work may focus on automating keyframe design, improving motion quality through advanced interpolation methods, and optimizing contact dynamics modeling to enhance performance in contact-rich tasks [28].