均值速度场
Search documents
ICLR 2026 Oral | 告别多步去噪!清华团队推出MVP,实现机器人动作单步极速生成
机器之心· 2026-03-16 10:23
Core Insights - The article discusses the breakthrough research on Mean Velocity Policy (MVP), which enhances the efficiency and quality of generative reinforcement learning by enabling one-step action generation while maintaining high expressiveness and speed [4][9][26]. Background - Generative reinforcement learning faces efficiency and quality bottlenecks, particularly in real-time control scenarios where optimal actions often exhibit multimodal distributions. Traditional methods struggle with high inference delays due to iterative denoising processes [5][6]. Key Contributions - MVP combines the high expressiveness of generative strategies with the time efficiency of one-step action generation, addressing the limitations of traditional methods [9][26]. Technical Innovations - Instantaneous Velocity Constraint (IVC) is introduced to anchor the mean flow policy, providing a unique boundary condition that enhances the precision and stability of the policy fitting process [12][14]. - The Generate-and-Select mechanism allows for efficient generation and selection of candidate actions, ensuring continuous improvement of the policy during the reinforcement learning process [16][18]. Experimental Results - MVP achieved state-of-the-art (SOTA) performance across various tasks in the Robomimic and OGBench benchmarks, demonstrating superior online convergence speed and final performance, particularly in complex tasks [20][21]. - The computational efficiency of MVP is significantly higher, with online training throughput improved by over 50% compared to traditional methods that require multiple steps for denoising [27]. Summary and Outlook - The research addresses the slow sampling speed and high inference delay in generative reinforcement learning, proposing the MVP framework that allows for instantaneous action generation without the need for distillation. This advancement indicates a new paradigm for embodied intelligent systems aiming for extreme responsiveness [26].