Workflow
Visual - Language - Action (VLA) Model
icon
Search documents
突破VLA模型推理瓶颈!GigaAI、中科院自动化所和清华大学联合推出最新VLA-R1大模型,真实场景执行成功率75%
机器人大讲堂· 2025-11-04 09:07
Core Insights - The article discusses the significance of Visual-Language-Action (VLA) models in embodied artificial intelligence, highlighting their ability to generalize across tasks and environments for robot interaction with the real world [1][3]. VLA Model Challenges - Existing VLA models face two main challenges: a lack of step-by-step reasoning, leading to failures in instruction disambiguation, and insufficient systematic reinforcement of reasoning post-training [2]. Introduction of VLA-R1 - VLA-R1 is a newly proposed reasoning-enhanced VLA model developed by GigaAI, CASIA, and Tsinghua University, which aims to bridge the gap between reasoning and execution through a structured framework [3]. VLA-CoT-13K Dataset - The research team created the VLA-CoT-13K dataset, consisting of 13,000 labeled data points that provide clear "thinking chains" for each task, detailing the reasoning process leading to action plans [5][7]. Reinforcement Learning Strategy - VLA-R1 employs a post-training strategy called "verifiable reward-based reinforcement learning," utilizing a "group relative policy optimization" algorithm to enhance training efficiency [9]. Reward Signals in Training - The model incorporates three verifiable reward signals: - Area alignment reward focuses on the accuracy of predicted operation areas [12]. - Trajectory consistency reward evaluates the smoothness and reasonableness of generated action trajectories [12]. - Output format reward ensures structured and clear output, promoting a "think before act" approach [12][13]. Performance Evaluation - VLA-R1 demonstrated impressive performance in various tests, achieving an IoU of 36.51 in in-domain tasks, a 17.78% improvement over the best baseline model, and maintaining strong performance in out-of-domain scenarios [14][15]. Robustness in Simulation - In simulated environments, VLA-R1 achieved an average success rate of 55% in affordance perception tasks and 70% in trajectory execution tasks across different robot models [17]. Real-World Application - In real-world evaluations, VLA-R1 achieved an average success rate of 62.5% in affordance perception and 75% in trajectory execution across challenging scenarios [19]. Future Directions - Future research will focus on expanding the adaptability of the model to more complex robotic platforms and optimizing the reward mechanism to enhance safety and robustness in real-world applications [20].