Multimodal Unified Pretrained Model
Search documents
昆仑万维:正式推出并开源多模态统一预训练模型Skywork UniPic
Zheng Quan Shi Bao Wang· 2025-07-30 03:04
Core Insights - Kunlun Wanwei officially launched and open-sourced the "Skywork UniPic," a self-regressive multimodal unified pre-training model that integrates image understanding, text-to-image generation, and image editing capabilities within a single model [1][2] - The model is based on large-scale high-quality data for end-to-end pre-training, demonstrating strong generalization and transferability [1] - Skywork UniPic follows the self-regressive paradigm of GPT-4o, marking the maturity of multimodal unified pre-training models in the AI field [1] Model Architecture - Traditional multimodal models often rely on VQ or VAE encoders, which focus more on visual details than semantic information, potentially weakening image understanding [1] - The Skywork UniPic team adopted the Harmon architecture design and made key adjustments in representation methods, using MAR encoders for visual representation in image generation and SigLIP2 as the backbone for image understanding [1][2] - The architecture allows for collaborative training and mutual enhancement of generation, understanding, and editing capabilities, overcoming technical bottlenecks in traditional methods [2] Efficiency and Design Philosophy - Skywork UniPic maintains the simplicity and efficiency of self-regressive models while achieving deep collaboration across tasks through shared encoders, laying a solid foundation for practical deployment of multimodal unified models [2] - The model features a compact parameter size of 1.5 billion, embodying the design philosophy of "small yet beautiful" technology aesthetics [2] - Over the past six months, the company has open-sourced several state-of-the-art models across various fields, with Skywork UniPic now joining the "Skywork" open-source family [2]