Galaxea 团队推出：大规模高质量开放世界机器人数据集与G0双系统VLA模型

Core Insights - The article presents the Galaxea Open-World Dataset, a large-scale and diverse collection of robot behaviors recorded in real human living and working environments, addressing the scarcity of high-quality open-world robot data and insufficient model generalization capabilities [2][5][6]. Dataset Overview - The Galaxea Open-World Dataset is the first large-scale robot behavior dataset collected in real-life scenarios, solving issues of existing datasets that are limited to controlled environments and inconsistent robot entities [5][17]. - Data collection was conducted using the Galaxea R1 Lite mobile dual-arm robot, which features 23 degrees of freedom and is equipped with RGB cameras for global scene perception and fine operation sensing [8][6]. - The dataset includes 500 hours of data, 100,000 demonstration trajectories, covering 150 task categories, 1,600 object types, and 58 operational skills, with a 2Hz frequency for detailed sub-task instruction labeling [8][12]. Model Framework - The G0 dual-system framework couples a "slow thinking" visual-language model (G0-VLM) with a "fast execution" visual-language-action model (G0-VLA), utilizing a three-stage training strategy to achieve complex task planning and precise execution [5][19]. - The training phases include cross-entity pre-training, single-entity pre-training, and task-specific fine-tuning, which are designed to balance general knowledge and specific robot adaptation [21][27]. Performance Evaluation - The G0-VLA model demonstrated superior performance in benchmark tasks such as desktop organization, microwave operation, bed making, and block building, with G0-VLM achieving an instruction accuracy of 78.2% in bed making and 83.3% in desktop organization [42][47]. - The study found that single-entity pre-training is essential for effective model performance, as cross-entity pre-training can lead to negative transfer due to significant differences between the training and target robot entities [39][46]. Key Findings - The dataset's design emphasizes real-world adaptability and model training friendliness, ensuring that the collected data reflects the complexities of human environments [6][17]. - The G0 model's architecture is inspired by Kahneman's dual-system theory, where System 2 (slow thinking) is responsible for planning and System 1 (fast execution) handles real-time reactions, allowing for a balance between planning rationality and execution timeliness [19][21].