端到端基础模型！VCoT-Grasp: 视觉思维链增强的机器人抓取检测大模型

Core Insights - The article introduces VCoT-Grasp, an end-to-end language-driven grasp generation model that incorporates Visual Chain-of-Thought reasoning to enhance visual understanding capabilities [10][16]. - A high-quality dataset, VCoT-GraspSet, was created to support the training of the model, consisting of 190K images and 1.36M grasp labels [9][10]. Background and Introduction - Chain-of-Thought (CoT) is a method that enhances the reasoning ability of large language models through intermediate thinking steps. Visual Chain-of-Thought (VCoT) extends this concept to image modalities [2]. - VCoT-Grasp applies VCoT to robotic grasping tasks to improve the quality of grasping actions [3]. Model Architecture and Innovation - The VCoT-Grasp model is based on the PaliGemma-3B visual language model and employs a two-stage reasoning process: first predicting the bounding box of the target object and then refining the grasp prediction using the cropped image [7][8]. - The model explicitly distinguishes the target object from the background during the bounding box prediction, allowing for better localization and grasping [7]. Dataset Development - The VCoT-GraspSet dataset was developed to address the quality issues of existing synthetic grasping datasets, ensuring a high-quality training resource [9][10]. Experimental Results - VCoT-Grasp demonstrated superior performance in various tests, achieving an average success rate of 83.60% on seen objects and 58.98% on unseen objects when using the LM head [11]. - The model also showed robustness against background changes and distractors, outperforming previous methods in these scenarios [16]. Conclusion - VCoT-Grasp fills a significant technical gap by validating multi-round processing paradigms in robotic models and exhibits excellent performance in both in-distribution and out-of-distribution scenarios [16].