Group 1 - The core viewpoint of the article is the introduction of the GLaD framework, which integrates 3D geometric priors into Vision-Language-Action (VLA) models to enhance their performance in robotic control tasks without the need for additional depth sensors or 3D annotations [2][4][28] - The existing VLA models primarily rely on 2D visual encoders, which limits their ability to understand 3D spatial information, leading to inaccuracies in task execution [2][4] - GLaD's architecture consists of a geometric distillation module and a staged training strategy, allowing for the effective integration of geometric knowledge into the VLA model [7][10] Group 2 - The geometric distillation module is the core innovation of GLaD, aligning the hidden states of visual tokens in the LLM with features from a geometric perception teacher model, thus achieving deep integration of geometric knowledge [9][10] - The training strategy is divided into two phases: the first phase focuses on geometric distillation pre-training using the Bridge dataset, while the second phase fine-tunes the model for downstream tasks like LIBERO [12][13] - GLaD achieved an average success rate of 94.1% on the LIBERO benchmark, outperforming other baseline models such as UniVLA and OpenVLA [14][16] Group 3 - The LIBERO benchmark consists of 130 language-conditioned operation tasks divided into four suites, assessing various aspects of model performance, including spatial knowledge transfer and long-range task capabilities [17][19] - GLaD demonstrated significant robustness in object perturbation scenarios, achieving a success rate of 81% in the GOAL suite, compared to 62% for UniVLA [16][19] - Ablation studies confirmed the effectiveness of GLaD's key design choices, showing that late-stage alignment of the LLM's final layer significantly improves task performance [20][26] Group 4 - The article discusses the core value of geometric understanding, highlighting that GLaD's ability to focus on task-relevant objects is a key factor in its high success rates [23][25] - The choice of the VGGT geometric encoder over other encoders resulted in a 29.8 percentage point improvement in the SPATIAL suite, demonstrating its suitability for spatial reasoning tasks [25][26] - Future directions include exploring more precise spatial relationship modeling to address current limitations in spatial layout generalization [27][28]
GLaD:知识蒸馏将3D几何先验注入VLA模型,任务成功率突破94%
具身智能之心·2025-12-12 01:22