Group 1 - The article discusses the need for robots to possess three core capabilities for operation in open environments: complex visual scene perception, natural language instruction understanding, and precise action generation [1][3] - Existing methods face significant bottlenecks, including insufficient generalization ability, coarse action control, and modeling paradigm contradictions [3][4] - The proposed framework introduces a continuous action discretization strategy, enhancing the stability of robot inference and allowing for fine-grained control [6][8] Group 2 - The architecture utilizes the PaliGemma open-source VLM as a backbone, adding a 300 million parameter action expert network to optimize action generation through a diffusion model [6][10] - The training process involves multi-modal observation encoding, action discretization, and Gaussian noise addition to ensure temporal consistency [8][9] - The inference process includes initializing a noise action sequence, multi-step denoising, and deterministic de-discretization to produce executable action blocks [10][11] Group 3 - The model achieves state-of-the-art (SOTA) performance across three benchmarks (LIBERO, VLABench, ManiSkill), with an average success rate exceeding baseline by 10.7% [21] - In the LIBERO benchmark, the model achieved an average success rate of 96%, demonstrating superior grasping and instruction execution capabilities [21] - The model also excels in high-precision tasks, achieving an average success rate of 55.2% in the ManiSkill benchmark, significantly outperforming baseline models [24][28] Group 4 - The article identifies limitations such as insufficient semantic alignment for specific tasks, challenges in complex coordination tasks, and inadequate modeling of mechanical interactions [32][35] - Future directions include enhancing cross-modal alignment for semantic-rich tasks, designing adaptive task sampling strategies, and integrating physical model priors to improve control precision [35]
E0:离散扩散新框架,大幅提升 VLA 模型泛化与操控精度
具身智能之心·2025-11-29 02:07