文生图进入R1时刻：港中文MMLab发布T2I-R1

Core Viewpoint - The article discusses the development of T2I-R1, a novel text-to-image generation model that utilizes a dual-level Chain of Thought (CoT) reasoning framework combined with reinforcement learning to enhance image generation quality and alignment with human expectations [1][3][11]. Group 1: Methodology - T2I-R1 employs two distinct levels of CoT reasoning: Semantic-CoT and Token-CoT. Semantic-CoT focuses on the global structure of the image, while Token-CoT deals with the detailed generation of image tokens [6][7]. - The model integrates Semantic-CoT to plan and reason about the image before generation, optimizing the alignment between prompts and generated images [7][8]. - Token-CoT generates image tokens sequentially, ensuring visual coherence and detail in the generated images [7][8]. Group 2: Model Enhancement - T2I-R1 enhances a unified language and vision model (ULM) by incorporating both Semantic-CoT and Token-CoT into a single framework for text-to-image generation [9][11]. - The model uses reinforcement learning to jointly optimize the two levels of CoT, allowing for multiple sets of Semantic-CoT and Token-CoT to be generated for a single image prompt [11][12]. Group 3: Experimental Results - The T2I-R1 model demonstrates improved robustness and alignment with human expectations when generating images based on prompts, particularly in unusual scenarios [13]. - Quantitative results indicate that T2I-R1 outperforms baseline models by 13% and 19% on the T2I-CompBench and WISE benchmarks, respectively, and surpasses previous state-of-the-art models [16].