Core Insights - The article discusses the limitations of current AI models in generating accurate structured images like charts and graphs, despite their success in creating natural images [1][2] - It highlights a significant gap between visual understanding and generation capabilities, which hinders the development of unified multimodal models that can both interpret and create visual content accurately [2][10] Data Layer - A dataset of 1.3 million code-aligned structured samples was created to ensure the accuracy of generated images through precise code definitions [11][13] - The dataset includes executable plotting codes covering six categories, ensuring strict alignment between images and their corresponding codes [14] Model Layer - A lightweight VLM integration solution was designed to balance the capabilities of structured and natural image generation, utilizing FLUX.1 Kontext and Qwen-VL for enhanced understanding of structured image inputs [13][15] - The training process involves a three-stage progressive training approach to maintain the model's ability to generate natural images while improving structured image generation [15][16] Evaluation Layer - The team introduced StructBench and StructScore as specialized benchmarks and metrics to assess the accuracy of generated structured images, addressing the shortcomings of existing evaluation methods [17][19] - StructBench includes 1,714 stratified samples with fine-grained Q&A pairs to validate factual accuracy, while StructScore evaluates model responses against standard answers [19] Performance Comparison - The proposed solution demonstrated significant advantages over existing models, with the best-performing models achieving factual accuracy around 50%, indicating substantial room for improvement in structured visual generation [21][22] - The research emphasizes that high-quality, strictly aligned data is crucial for enhancing model performance, more so than the model architecture itself [22] Broader Implications - This research aims to lay a systematic foundation for structured visual generation, encouraging further exploration in this overlooked area [23][25] - The ultimate goal is to transition AI from being merely a beautification tool to a productivity tool capable of generating accurate mathematical images and experimental charts for various fields [24][25]
告别AI“乱画图表”!港中文团队发布首个结构化图像生成编辑系统
量子位·2025-10-11 09:01