告别专家依赖，让机器人学会自我参考，仅需200步性能飙升至99.2%

Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which enhances the performance of Visual Language Action (VLA) models in robotic tasks by addressing the challenges of sparse rewards and dependency on expert demonstrations [3][11]. Motivation and Contribution - Recent research indicates that reinforcement learning (RL) can significantly improve VLA models' performance both within and outside their training distribution. However, the challenge of sparse reward signals remains, particularly in VLA tasks where high computational costs and inefficient use of failure trajectory information hinder training efficiency [6][11]. - The SRPO framework alleviates the dependency on expert demonstrations and task-specific reward engineering by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [11][12]. Technical Approach - SRPO employs a "learn from success" paradigm, where trajectories generated during policy inference are collected and categorized into successful and failed attempts. The framework uses a potential world representation to model behavior similarity and calculate progressive rewards [14][16]. - The framework formalizes the robotic decision-making process as a partially observable Markov decision process (POMDP), introducing a world model-driven reward modeling mechanism that provides progressive reward signals for failed trajectories [18][19]. Experimental Results - SRPO achieved a success rate of 99.2% with only 200 steps of reinforcement learning, significantly outperforming baseline models that rely on sparse rewards or require manual reward design [27]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, even without training on any generalized scenario data [30]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, showcasing its superior information utilization compared to traditional methods [34]. - The reward modeling of SRPO has been tested in real-world environments, showing significant success rate improvements for various tasks [37]. Conclusion - SRPO represents a significant advancement in VLA reinforcement learning, enabling robots to transition from imitation to autonomous exploration without the need for expensive data labeling or complex reward designs [51].