ICLR 2026｜在「想象」中进化的机器人：港科大×字节跳动Seed提出WMPO，在世界模型中进行VLA强化学习

Core Insights - The article discusses the WMPO (World Model-based Policy Optimization) method developed by the Hong Kong University of Science and Technology PEI-Lab and ByteDance Seed team, which allows embodied intelligence to train in "imagination" without extensive real-world reinforcement learning interactions [2][3]. Group 1: Traditional VLA Training Limitations - Traditional Vision-Language-Action (VLA) models face two main bottlenecks: the inherent limitations of imitation learning and the high costs of real-world reinforcement learning [3][4]. - Imitation learning primarily teaches models "what is the correct action" but fails to address "what to do after making a mistake," leading to cumulative errors in slightly deviated states [4]. - Real-world reinforcement learning requires millions of attempts, resulting in low sampling efficiency, hardware wear, safety risks, and high experimental costs [5]. Group 2: WMPO's Core Breakthroughs - WMPO introduces a new training paradigm that shifts the policy optimization process entirely into a visual world model, allowing embodied agents to learn recovery from errors in "imagined" trajectories [8]. - The method employs a pixel-level visual world model that simulates errors realistically, enhancing the model's ability to predict outcomes of out-of-distribution (OOD) actions through a Policy Behavior Alignment mechanism [8][14]. - WMPO incorporates Online Group Relative Policy Optimization (GRPO) in the imagined space, generating multiple candidate trajectories for the same initial state and evaluating their success through a trained reward function [9][15]. Group 3: Addressing Long-Term Generation Challenges - WMPO tackles the challenge of long-term video prediction by ensuring that the imagined visuals remain clear and actions aligned over hundreds of frames, thus providing a stable training environment for policy optimization [10]. - Techniques such as noisy-frame conditioning and frame-level action control mechanisms are introduced to maintain the quality of the generated trajectories [10]. Group 4: WMPO Architecture and Learning Objectives - WMPO's architecture relies on high-fidelity visual world modeling, predicting the next frame based on current observations and actions without abstract latent space predictions [12]. - The learning objective focuses on self-supervised parameter optimization, transforming the VLA model from a mere imitator to a self-evolving decision-maker [20]. Group 5: Experimental Results - The WMPO method shows significant improvements in sampling efficiency, with a success rate exceeding the optimal offline RL baseline by 9.8% using only 128 real trajectories, and a further 15.2% advantage with 1280 trajectories [23]. - Self-correcting behaviors emerged in tasks where the base model adjusted actions after collisions or misalignments, demonstrating the model's ability to learn from imagined failures [24]. - The execution efficiency of WMPO-trained strategies is higher, with more coherent and decisive actions, leading to shorter successful trajectory lengths [26]. Group 6: Implications and Future Directions - WMPO's success indicates that high-quality "imagination" can effectively replace costly "practice," addressing sampling efficiency issues while enabling robots to learn self-improvement through setbacks [28]. - The approach suggests a promising pathway for embodied intelligence towards generalization, as highlighted by the quote from Da Vinci, "Simplicity is the ultimate sophistication" [29].