Workflow
SpatialGenEval
icon
Search documents
ICLR 2026 | 阿里高德发布SpatialGenEval,揭秘谁才是真正的文生图大师
机器之心· 2026-02-18 12:51
Core Insights - The article discusses the limitations of current Text-to-Image (T2I) models in spatial intelligence, particularly in complex tasks involving spatial perception, reasoning, and interaction [2][4][36] - A new benchmarking framework called SpatialGenEval is introduced, which aims to evaluate T2I models on ten dimensions of spatial intelligence through dense, long-text prompts [2][3][39] - The research highlights that existing models excel in generating images but struggle with spatial logic and reasoning, indicating a significant gap in their capabilities [4][36] Group 1: Spatial Intelligence Evaluation - SpatialGenEval breaks down spatial intelligence into four main dimensions and ten sub-dimensions, covering 25 real-world application scenarios [3][6] - The evaluation framework includes 1,230 long, information-dense prompts designed to test various aspects of spatial intelligence, such as object categories, spatial positions, and interactions [9][39] - The study reveals that current models' spatial reasoning scores are around 30%, close to random guessing levels, indicating a lack of understanding of 3D scene structures [36][37] Group 2: Model Performance and Improvements - The strongest open-source model, Qwen-Image, has achieved a score of 60.6%, nearly matching the top closed-source model, Seed Dream 4.0, at 62.7%, but both are still below the passing threshold [36][37] - The research emphasizes the importance of powerful text encoders, noting that models using high-performance large language models (LLMs) significantly outperform those relying solely on CLIP [36][39] - The study proposes a method to enhance model performance by fine-tuning with a new dataset of 15,400 image-text pairs, resulting in improved spatial evaluation metrics [39][42]