Workflow
视觉思维链
icon
Search documents
CVPR 2026 Workshop征稿|从感知到推理,ViSCALE 2.0 邀你重塑计算机视觉的 System 2
机器之心· 2026-02-13 04:19
Core Insights - The article discusses the evolution of computer vision towards a new paradigm, emphasizing the transition from basic pixel perception to complex spatial reasoning and world modeling, facilitated by Test-time Scaling (TTS) [2][5] - The upcoming ViSCALE 2026 conference aims to gather leading scholars to explore breakthroughs in visual models through computational expansion, focusing on deep reasoning rather than mere static outputs [4][5] Group 1: Conference Highlights - ViSCALE 2026 will feature discussions on spatial intelligence and world models, with contributions from top scholars including Sergey Levine, Manling Li, and Ziwei Liu [5] - The conference encourages innovative research submissions that challenge existing visual model limitations, providing a platform for both theoretical and application-focused studies [7] Group 2: Key Topics of Discussion - The conference will cover various topics, including: - Enhancing video generation's physical consistency and long-term causal reasoning through TTS [6] - Breaking 2D limitations to enable models to navigate and operate in 3D spaces like humans [6] - Developing visual reasoning chains that allow models to self-correct and engage in multi-step reasoning [6] - Exploring scaling laws that relate computational load during testing to visual reasoning performance [6] Group 3: Submission Details - The conference invites submissions in two tracks: Full Papers (8 pages) and Extended Abstracts (up to 4 pages), with specific formatting requirements [9] - Important deadlines include submission by March 10, 2026, and notification of acceptance by March 18, 2026 [9]
端到端基础模型!VCoT-Grasp: 视觉思维链增强的机器人抓取检测大模型
具身智能之心· 2025-10-19 13:50
Core Insights - The article introduces VCoT-Grasp, an end-to-end language-driven grasp generation model that incorporates Visual Chain-of-Thought reasoning to enhance visual understanding capabilities [10][16]. - A high-quality dataset, VCoT-GraspSet, was created to support the training of the model, consisting of 190K images and 1.36M grasp labels [9][10]. Background and Introduction - Chain-of-Thought (CoT) is a method that enhances the reasoning ability of large language models through intermediate thinking steps. Visual Chain-of-Thought (VCoT) extends this concept to image modalities [2]. - VCoT-Grasp applies VCoT to robotic grasping tasks to improve the quality of grasping actions [3]. Model Architecture and Innovation - The VCoT-Grasp model is based on the PaliGemma-3B visual language model and employs a two-stage reasoning process: first predicting the bounding box of the target object and then refining the grasp prediction using the cropped image [7][8]. - The model explicitly distinguishes the target object from the background during the bounding box prediction, allowing for better localization and grasping [7]. Dataset Development - The VCoT-GraspSet dataset was developed to address the quality issues of existing synthetic grasping datasets, ensuring a high-quality training resource [9][10]. Experimental Results - VCoT-Grasp demonstrated superior performance in various tests, achieving an average success rate of 83.60% on seen objects and 58.98% on unseen objects when using the LM head [11]. - The model also showed robustness against background changes and distractors, outperforming previous methods in these scenarios [16]. Conclusion - VCoT-Grasp fills a significant technical gap by validating multi-round processing paradigms in robotic models and exhibits excellent performance in both in-distribution and out-of-distribution scenarios [16].