告别专家依赖，让机器人学会自我参考，仅需200步性能飙升至99.2%

Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which addresses the limitations of existing Visual Language Action (VLA) models in robotic tasks by enabling robots to learn from their own experiences without relying on external expert data [3][10][56]. Motivation and Contribution - SRPO aims to overcome the challenges of sparse reward signals in reinforcement learning, particularly in the VLA domain, by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [6][10]. - The framework eliminates the need for costly expert demonstrations and task-specific reward engineering, thus enhancing the efficiency of the learning process [10][12]. Technical Approach - SRPO collects trajectories generated during policy inference and categorizes them into successful and failed attempts, using a potential world representation to model behavior similarity [16][17]. - The framework employs a progressive reward mechanism based on the distance of failed trajectories to successful trajectory representations, allowing for a more nuanced evaluation of task progress [22][24]. Experimental Results - SRPO achieved a success rate of 99.2% in the LIBERO benchmark with only 200 steps of reinforcement learning, significantly outperforming traditional methods that rely on sparse rewards [29][30]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, showcasing its robust generalization capabilities without the need for additional training data [31][32]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, outperforming other models in terms of training efficiency [36][39]. - The framework has been tested in real-world scenarios, showing significant improvements in success rates compared to supervised fine-tuning baselines [41][39]. Conclusion - SRPO represents a significant advancement in robotic learning, allowing for autonomous exploration and creativity by enabling robots to learn from their own successes and failures, thus paving the way for a new approach in VLA reinforcement learning [56].