Workflow
EgoVLA
icon
Search documents
加利福尼亚大学!EgoVLA:从第一视角人类视频中学习VLA模型
具身智能之心· 2025-07-20 01:06
Core Insights - The article discusses a novel approach to robot learning that leverages egocentric human video data to enhance the training of Vision-Language-Action (VLA) models, overcoming limitations of traditional robot data collection methods [3][21]. Research Background and Core Ideas - Traditional robot learning relies heavily on large-scale real robot data, which is limited by hardware and operational costs. In contrast, human actions in various environments provide a vast amount of potential training data, as billions of people continuously engage in tasks where robots are expected to operate [3]. - The key breakthrough is the approximation of the action space difference between humans and robots through geometric transformations. This allows for training VLA models on human video data first, followed by fine-tuning with a small amount of robot demonstrations, facilitating skill transfer [3]. Model Architecture and Action Space Design - The framework is based on NVILA-2B, utilizing its visual-language understanding capabilities for efficient intent reasoning and fine-tuning. Inputs include current and historical first-person visual observations, language instructions, action query tokens, and human body sensations [5]. - The action space incorporates human wrist poses and the first 15 PCA components of the MANO hand model, balancing compactness and expressiveness for action transfer from humans to robots [8]. Training and Evaluation - A large-scale dataset of approximately 500,000 image-action pairs was created from four sources, covering various rigid objects and annotated with RGB observations, wrist poses, hand poses, and camera poses [12]. - The Ego Humanoid Manipulation Benchmark was established for unified evaluation of humanoid robot manipulation capabilities, consisting of 12 tasks and addressing data balance issues [14]. Experimental Results and Key Findings - Human pre-training significantly enhances core performance, with the EgoVLA model showing a success rate improvement of about 20% in fine manipulation tasks compared to models without pre-training [16][20]. - The model demonstrates robust performance across different visual configurations, with only a slight decrease in success rates for unseen visual backgrounds, indicating adaptability to new environments [20]. Impact of Data Scale and Diversity - Higher diversity in human data correlates with better model generalization, as evidenced by the combined model's superior performance in short-horizon tasks compared to those trained on single datasets [23]. - The performance of the EgoVLA model declines when relying solely on robot demonstration data, highlighting the necessity of combining human pre-training with a certain amount of robot data for optimal results [23].