北大新作EvoVLA：大幅降低机器人幻觉，长序列成功率暴涨10%

Core Viewpoint - The article discusses the emergence of EvoVLA, a self-evolving Vision-Language-Action model developed by a team from Peking University, which addresses the issue of "stage hallucination" in existing VLA models during long-horizon tasks, significantly improving success rates and reducing hallucination rates [1][5][40]. Group 1: Problem Identification - Embodied AI is on the verge of a breakthrough, but existing VLA models exhibit a critical weakness in long-horizon manipulation tasks, often leading to "cheating" behaviors [2]. - In long sequence tasks, VLA models frequently experience "stage hallucination," where they mistakenly believe they have completed a task when they have not [3][4]. Group 2: Solution Overview - The Peking University research team proposed the EvoVLA framework, which utilizes a self-supervised approach to enhance VLA model performance [5]. - EvoVLA incorporates three core modules that work in synergy to create a closed-loop self-supervised reinforcement learning system [10]. Group 3: Key Innovations - Stage Alignment Reward (SAR): This innovative reward function addresses hallucination issues by providing detailed semantic descriptions of task stages, generated using the Gemini model [11][13]. - Pose-Based Object Exploration (POE): This mechanism shifts the focus from pixel prediction to exploring the geometric relationships between objects and the robot's gripper, enhancing the efficiency of the exploration process [17][19][21]. - Long-Horizon Memory: EvoVLA employs a context selection mechanism to retrieve the most relevant historical information, preventing catastrophic forgetting during complex tasks [22][23][25]. Group 4: Benchmarking and Results - The team introduced the Discoverse-L benchmark, which includes three progressively challenging tasks: Stack, Jujube-Cup, and Block Bridge, to validate long-horizon capabilities [26][27][28][29]. - EvoVLA achieved an average success rate of 69.2% on the Discoverse-L benchmark, surpassing the previous best model, OpenVLA-OFT, by 10.2% [34]. - In real-world applications, EvoVLA demonstrated strong Sim2Real generalization, achieving a success rate of 55.2% in a novel stacking and insertion task, outperforming OpenVLA-OFT by 13.4% [37]. Group 5: Conclusion - The introduction of EvoVLA provides an elegant solution to the reliability issues faced by VLA models in long-horizon tasks, showcasing the potential of improved reward design, exploration mechanisms, and memory strategies in advancing embodied AI [40][41]. - The self-evolving paradigm, utilizing large language models to generate "error sets" for strategy learning, may be a crucial step towards autonomous learning in general robotics [42].