Unified-GRPO训练策略
Search documents
张祥雨发现的多模态AI内耗难题,北大找到了解法
3 6 Ke· 2025-09-19 10:52
Core Insights - The main issue in multimodal AI training is the internal conflict between understanding and generating capabilities, which often leads to performance degradation in one area when the other is improved [1][5] - A new framework called UAE has been proposed to address the fundamental problem of conflicting training objectives between understanding and generating tasks, suggesting a unified approach instead of separate KPIs [3][5] Group 1: Challenges in Multimodal AI - Zhang Xiangyu highlighted that in unified multimodal model training, visual understanding and generation can coexist but rarely collaborate, leading to internal strife [1] - The complexity of image generation requires intricate spatial planning, physical knowledge, and semantic reasoning, which the Transformer model struggles to handle in a single forward pass [1] - The traditional approach of decoupling understanding and generation has led to a lack of true synergy, resulting in models that coexist without effective collaboration [9] Group 2: The UAE Framework - The UAE framework proposes a radical shift by eliminating separate KPIs and establishing a unified pipeline with a single quality control standard [10] - This framework draws inspiration from the classic auto-encoder model, where the understanding task is likened to encoding and the generation task to decoding [11][15] - The UAE framework aims to ensure that the output image is a near-perfect reconstruction of the original input, thus aligning the objectives of both understanding and generating modules [17][18] Group 3: Training Methodology - UAE introduces a three-phase training strategy called Unified-GRPO, which emphasizes a "left-right loop, two-way reinforcement" approach to enhance collaboration between understanding and generating modules [20] - The first phase focuses on establishing basic communication between the two modules, ensuring that the generation module can reconstruct images from the understanding module's outputs [22][23] - Subsequent phases involve specialized training for each module, where the understanding module learns to generate detailed descriptions, and the generation module learns to execute complex instructions based on those descriptions [24][29] Group 4: Performance Outcomes - The UAE model has demonstrated significant improvements in generating detailed and accurate descriptions compared to other models, achieving higher scores in various evaluation metrics [36][37] - In the GenEval benchmark, UAE achieved a comprehensive score of 0.86, ranking first among unified models, particularly excelling in tasks requiring precise understanding [38] - The results indicate that with the right objectives and training methods, AI systems can discover more effective information representation and transmission strategies [38][39]