Workflow
DexVLG
icon
Search documents
ICCV2025 | DexVLG:大规模灵巧视觉-语言-抓取模型
具身智能之心· 2025-07-07 09:20
Core Insights - The article discusses the development of DexVLG, a large-scale vision-language-grasp model designed to enable robots to perform dexterous grasping tasks based on language instructions and single-view RGBD inputs [4][8]. Group 1: Motivation and Background - The rise of large models has led to advancements in visual-language-action systems, allowing robots to handle increasingly complex tasks. However, research has primarily focused on simple end-effector control due to challenges in data collection for dexterous manipulation [4][5]. - DexVLG utilizes a dataset called DexGraspNet 3.0, which contains 1.7 billion dexterous grasp poses mapped to 174,000 simulated target objects, providing a substantial foundation for training [4][6]. Group 2: Dataset Overview - DexGraspNet 3.0 is the largest dataset for dexterous grasping, featuring 1.7 billion poses validated in a physics-based simulator, IsaacGym, and includes semantic titles and part-level annotations [10][11]. - The dataset was constructed using advanced techniques for part perception and semantic understanding, leveraging models like SAMesh and GPT-4o for part segmentation and title generation [6][12]. Group 3: Model Development - DexVLG is developed to generate dexterous grasp poses based on language instructions and single-view point clouds, utilizing billions of parameters and fine-tuning on the large dataset [8][25]. - The model employs a point cloud encoder and a language foundation model, integrating features from both to predict grasp poses effectively [26][28]. Group 4: Performance Evaluation - DexVLG demonstrated superior performance in various benchmarks, achieving over 76% success rate in simulated environments and outperforming baseline models in grasp quality and alignment with language instructions [8][30][32]. - The model's ability to generalize to unseen objects and semantic parts was a key focus, with metrics defined to assess grasp quality and instruction alignment [30][32].