视觉思维链(visual CoT)

Search documents
Zebra-CoT:开创性视觉思维链数据集问世,多模态推理准确率提升13%
具身智能之心· 2025-07-24 09:53
Core Viewpoint - The article discusses the development of Zebra-CoT, a large-scale and diverse dataset aimed at enhancing visual reasoning capabilities in multi-modal models, addressing the challenges of existing visual CoT performance and the lack of high-quality training data [3][4]. Dataset Construction - Zebra-CoT consists of 182,384 samples, providing logical interleaved text-image reasoning trajectories across four main task categories: scientific reasoning, 2D visual reasoning, 3D visual reasoning, and visual logic and strategy games [6][12]. - The dataset overcomes limitations of existing datasets by offering a diverse range of tasks and ensuring high-quality text reasoning data, unlike previous datasets that focused on single tasks or lacked clear reasoning structures [6][18]. Task Coverage - The dataset covers four major task categories: - Scientific reasoning includes geometry, physics, chemistry, and algorithm problems [9]. - 2D visual reasoning encompasses visual search and visual puzzles [9]. - 3D visual reasoning involves multi-hop object counting and robot planning [9]. - Visual logic and strategy games feature chess, checkers, mazes, and more [9]. Data Sources and Processing - Real-world data is sourced from online resources, ensuring high-quality problem extraction and addressing issues of logical connections between modalities [10]. - Synthetic data is generated using templates and visual language models (VLM) to enhance reasoning diversity and expressiveness [10]. Model Fine-tuning and Performance - Fine-tuning the Anole-7B model on Zebra-CoT improved accuracy from 4.2% to 16.9%, a fourfold increase, with notable improvements in visual logic benchmarks [14]. - The Bagel-7B model, after fine-tuning, demonstrated the ability to generate high-quality interleaved visual reasoning chains, showcasing the dataset's effectiveness in developing multi-modal reasoning capabilities [14]. Limitations - Despite its strengths, the dataset relies on template generation for synthetic data, which may limit the diversity and expressiveness of text reasoning [18]. - Some sub-tasks within the dataset have a small sample size, potentially affecting model performance in those areas [18]. - Model fine-tuning results may vary, with some tasks showing insignificant or even decreased performance, indicating a need for further optimization [18].