缺数据也能拿SOTA？清华&上海AI Lab破解机器人RL两大瓶颈

Core Insights - The article discusses the development of SimpleVLA-RL, a new framework designed to enhance the training and generalization capabilities of Visual-Language-Action (VLA) models in robotics, addressing key limitations in existing training paradigms [4][14]. Group 1: Key Contributions of SimpleVLA-RL - SimpleVLA-RL effectively addresses three major bottlenecks in VLA model training: high data collection costs, insufficient generalization ability, and the need for large-scale demonstration data [6][11]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks such as LIBERO and RoboTwin, achieving significant improvements in success rates even with limited data [6][21]. - In scenarios with single demonstration data, the average success rate of OpenVLA-OFT in LIBERO increased from 48.9% to 96.9%, and for long-sequence tasks, it improved from 17.3% to 91.7% [6][21]. Group 2: Training Mechanism and Innovations - The training mechanism includes interactive trajectory sampling, result reward modeling, and exploration enhancement, which collectively improve data efficiency and model performance [15][16][17]. - The result reward model simplifies the reward structure to binary outcomes (success or failure), allowing for better focus on training objectives and avoiding the complexities of process rewards [16][21]. - The exploration enhancement strategy encourages diverse exploration during training, preventing the model from converging to narrow solutions [17][19]. Group 3: Performance Metrics and Benchmark Results - SimpleVLA-RL achieved an average success rate of 99.1% in the LIBERO benchmark, with specific improvements in long-sequence tasks, where success rates increased by 12.0 percentage points [23]. - In RoboTwin1.0, the average success rate improved from 39.8% to 70.4%, with notable gains in specific tasks such as "Blocks Stack," which saw a 33.1 percentage point increase [25]. - The framework also demonstrated significant performance improvements in RoboTwin2.0, with average success rates rising from 38.3% to 68.8%, surpassing previous models [27]. Group 4: Real-World Application and Generalization - The model trained solely on simulation data showed enhanced adaptability to real-world tasks, with average success rates increasing from 17.5% to 38.5% in practical applications [30]. - The emergence of the "Pushcut" phenomenon indicates that the model can autonomously discover new strategies beyond human demonstrations, showcasing its potential for adaptive learning [32][34].