让多模态大模型「想明白再画」！港大等开源GoT-R1：强化学习解锁视觉生成推理新范式

Core Viewpoint - The article discusses the significant advancements in multimodal large models for generating high-fidelity images from complex text prompts, while also highlighting the challenges faced in accurately interpreting spatial relationships and multi-object attributes [1][2]. Group 1: Introduction of GoT-R1 - A research team from the University of Hong Kong, Chinese University of Hong Kong, and SenseTime has introduced GoT-R1, an important advancement following the Generation Chain-of-Thought (GoT) framework [2]. - GoT-R1 enhances the semantic-spatial reasoning capabilities of multimodal large models through the innovative application of reinforcement learning, allowing the model to autonomously explore and learn better reasoning strategies [3][5]. Group 2: Limitations of GoT Framework - The GoT framework improves image generation accuracy and controllability by explicitly planning semantic content and spatial layout before image generation, but its reasoning capabilities are limited by supervised fine-tuning data based on predefined templates [4][13]. - GoT-R1 aims to overcome these limitations by introducing reinforcement learning into the semantic-spatial reasoning process, enabling the model to learn and optimize reasoning paths independently [5][13]. Group 3: Reward Mechanism in GoT-R1 - GoT-R1 constructs a comprehensive and effective reward mechanism for visual generation tasks, evaluating multiple dimensions of the generated results, including semantic consistency, spatial accuracy, and overall aesthetic quality [13][14]. - The reward framework includes: 1. Reasoning Process Evaluation Reward (RPR) [14] 2. Reasoning-to-Image Alignment Reward (RRI), which quantifies adherence to the reasoning chain using Intersection over Union (IoU) [15] 3. Semantic Alignment Reward (Rsem) and Spatial Alignment Reward (Rspa), which assess the completeness and accuracy of the reasoning chain against the original text prompt [16] 4. Text-to-Image Alignment Reward (RPI), which evaluates the overall consistency of the generated image with the original text prompt [17]. Group 4: Performance Evaluation of GoT-R1 - GoT-R1 was evaluated on the challenging T2I-CompBench, where it established new state-of-the-art (SOTA) performance, achieving the highest scores in five out of six evaluation categories [21][23]. - The model demonstrated significant advantages in handling complex, multi-layered instructions, particularly in the "Complex" benchmark [23]. - Compared to the baseline model, GoT-R1-7B achieved up to a 15% improvement in evaluation metrics, showcasing the effectiveness of reinforcement learning in enhancing the model's reasoning capabilities [24][25]. Group 5: Comparison of Reasoning Chains - A comparative analysis using GPT-4o revealed that GoT-R1 generated reasoning chains were preferred over those from the baseline model across all evaluation categories, particularly in spatial relationship understanding [25][26].