美团提出全新多模态统一大模型STAR，GenEval突破0.91，破解“理解-生成”零和困局

Core Insights - Meituan has launched a new multimodal model solution called STAR, which achieves breakthroughs in understanding and generation capabilities through its innovative design of "stacked autoregressive architecture + task-progressive training" [2][11] - STAR has demonstrated state-of-the-art (SOTA) performance in various benchmarks, including GenEval, DPG-Bench, and ImgEdit, making it suitable for industrial applications [2][22] Industry Pain Points - The pursuit of unifying "visual understanding" and "image generation" in a single parameter space faces the "curse of capability," characterized by three main contradictions [7] - Conflicting optimization goals between semantic alignment and pixel fidelity, leading to a zero-sum game in joint training [8] - Complex training paradigms with high costs due to end-to-end training and hybrid architectures [9] - Capacity degradation issues, such as catastrophic forgetting and capacity saturation, when introducing new tasks [10] Core Innovations - STAR reconstructs the "growth law of capabilities" in multimodal learning, focusing on a system that allows for "capability stacking without conflict" [12][13] - The core architecture features a stacked isomorphic autoregressive model that simplifies the complexity of capability expansion [14] - The task-progressive training paradigm breaks down multimodal learning into four progressive stages, ensuring existing capabilities are preserved while new skills are developed [16][18] Experimental Results - STAR has shown exceptional performance in generation tasks, achieving a score of 0.91 in GenEval and 87.44 in DPG-Bench, outperforming competitors in various sub-tasks [23][24] - In editing tasks, STAR-7B scored 4.34 in ImgEdit, demonstrating strong adaptability and precision in responding to various editing commands [26] - STAR maintains top-tier understanding capabilities across nine authoritative benchmarks, outperforming similar multimodal models [28] Summary and Outlook - STAR represents a significant advancement in achieving comprehensive capability unification through a simplified structure, addressing training conflicts, and enhancing performance in understanding, generation, and editing tasks [31] - Future exploration may include expanding capability boundaries to incorporate more complex multimodal tasks, optimizing efficiency, deepening reasoning capabilities, and upgrading multimodal integration [32]