天大&清华最新！GeoVLA：增强VLA模型的3D特征提取能力，鲁棒提升明显（SOTA）

Core Insights - The article introduces GeoVLA, a novel framework that integrates 3D information into Vision-Language-Action (VLA) models, enhancing robots' spatial perception and adaptability [3][9][10]. Group 1: Background and Motivation - The advancement of robotic operations requires intelligent interaction and precise physical control in real-world environments. Recent VLA models have gained attention for their ability to follow instructions and execute actions [7]. - Current VLA models primarily rely on 2D visual inputs, neglecting the rich geometric information inherent in the 3D physical world, which limits their spatial perception capabilities [8]. Group 2: GeoVLA Framework - GeoVLA employs a visual-language model (VLM) to process images and language instructions, extracting fused visual-language embeddings. It converts depth maps into point clouds and uses a custom point embedding network to generate 3D geometric embeddings [3][10][12]. - The framework consists of three key components: VLM for general understanding, a point embedding network (PEN) for extracting fine-grained 3D features, and a 3D enhanced action expert (3DAE) for generating action sequences [12][13]. Group 3: Performance Evaluation - GeoVLA was evaluated on the LIBERO and ManiSkill2 benchmarks, achieving state-of-the-art results. It demonstrated significant robustness in real-world tasks requiring high adaptability and spatial awareness [15][27]. - In LIBERO, GeoVLA achieved an average success rate of 97.7%, outperforming other models like CogACT (93.2%) and OpenVLA-OFT (95.3%) [27]. - In the ManiSkill2 benchmark, GeoVLA achieved a success rate of 77%, surpassing CogACT (69%) and Dita (66%) [27]. Group 4: Ablation Studies - Ablation studies indicated that the PEN encoder outperformed traditional encoders, achieving a success rate of 97.7% compared to 95.8% for MLP and 95.2% for PointNet [30]. - The use of static routing in the MoE architecture improved performance, demonstrating the effectiveness of the design in leveraging multimodal information [30][20]. Group 5: Real-World Experiments - Real-world experiments showcased GeoVLA's robustness and generalization capabilities across various 3D manipulation tasks, maintaining high performance despite changes in camera perspective, height, and object size [36][34]. - GeoVLA achieved an average success rate of 86.3% across basic and 3D perception tasks, outperforming other models by significant margins [36].