ICCV2025 | DexVLG：大规模灵巧视觉-语言-抓取模型~

Core Viewpoint - The article discusses the development of DexVLG, a large-scale vision-language-grasp model that utilizes a newly created dataset, DexGraspNet 3.0, to enable robots to perform dexterous grasping tasks based on language instructions and single-view RGBD inputs [3][7][9]. Group 1: Motivation and Background - The rise of large models has enabled robots to handle increasingly complex tasks through visual-language-action systems, but research has primarily focused on simple end-effectors due to data collection challenges [3][4]. - DexGraspNet 3.0 is introduced as a large-scale dataset containing 1.7 billion dexterous grasping poses mapped to 174,000 simulated objects, aimed at training a vision-language model for functional grasping [5][9]. Group 2: Dataset Overview - DexGraspNet 3.0 is the largest dataset for dexterous grasping, featuring 1.7 billion poses validated in a physics-based simulator, with semantic titles and part-level annotations [9][10]. - The dataset includes a diverse range of objects sourced from the Objaverse dataset, with part segmentation performed using advanced models like SAMesh and GPT-4o [11]. Group 3: Model Development - DexVLG is developed to generate dexterous grasping poses based on language instructions and single-view point clouds, utilizing billions of parameters and pre-trained models for feature extraction [7][24]. - The model employs a point cloud encoder and a language foundation model to align visual and linguistic features, facilitating the generation of grasping poses [25][27]. Group 4: Performance Evaluation - DexVLG demonstrates superior performance in zero-shot generalization, achieving over 76% success rate in simulated environments and outperforming baseline models in various benchmarks [7][29][31]. - The model's grasping poses are evaluated for quality and alignment with language instructions, showcasing its capability to generate high-quality dexterous grasping poses across different objects and semantic parts [29][31].