Galaxea 团队推出：大规模高质量开放世界数据集与G0双系统VLA模型

Core Insights - The article presents the Galaxea Open-World Dataset, a large-scale and diverse collection of robot behaviors recorded in real human living and working environments, addressing the scarcity of high-quality open-world robot data and insufficient model generalization [3][5][6]. Dataset Overview - The dataset consists of 500 hours of data, 100,000 demonstration trajectories, covering 150 task categories, 1,600 object types, and 58 operational skills, with a 2Hz frequency for detailed sub-task instruction labeling [8][12]. - Data was collected using the Galaxea R1 Lite mobile dual-arm robot, which has 23 degrees of freedom and is equipped with RGB cameras for global scene perception and fine operation sensing [5][6]. Data Diversity and Coverage - The dataset includes data from 11 physical sites across 50 unique scenarios, covering residential, retail, dining, and office environments, thus avoiding the limitations of existing datasets that are confined to controlled laboratory settings [6][12]. - The distribution of tasks shows a balance between basic actions and specialized skills, with residential scenes making up 50.8% and office scenes 33.2% of the dataset [11][12]. G0 Dual-System Framework - The G0 framework couples a "slow thinking" visual-language model (G0-VLM) with a "fast execution" visual-language-action model (G0-VLA), employing a three-stage training strategy to achieve complex task planning and precise execution [5][19]. - The training phases include cross-entity pre-training, single-entity pre-training, and task-specific fine-tuning, which enhance the model's performance significantly [21][30]. Model Performance Evaluation - The G0-VLA model demonstrated superior performance in benchmark tasks such as desktop organization and microwave operation, with G0-Full achieving the highest average task progress scores [39][47]. - The study found that single-entity pre-training is essential for effective model adaptation, as cross-entity pre-training can lead to negative transfer due to significant differences between the training and target robot entities [39][46]. Key Findings - The G0-VLM model outperformed mainstream visual-language models in instruction accuracy, achieving 83.3% in desktop organization and 78.2% in bed-making tasks, highlighting the importance of domain-specific fine-tuning [42][47]. - The dataset's design and the dual-system framework effectively address the challenges of real-world robot task execution, providing a robust foundation for future advancements in embodied intelligence [17][19].