字节把GPT-4o级图像生成能力开源了！

Core Viewpoint - ByteDance has recently made significant advancements in open-source technology by releasing the BAGEL model, which integrates multi-modal capabilities for image generation, editing, and reasoning, positioning itself as a leader in the AI field [1][2][4]. Group 1: Model Features and Capabilities - The BAGEL model features a unified architecture that combines image reasoning, generation, and editing into a single framework, showcasing its versatility [2][32]. - Despite having only 7 billion active parameters (14 billion total), BAGEL has demonstrated superior performance in image understanding, generation, and editing, rivaling both open-source and closed-source models like Stable Diffusion 3 and GPT-4o [3][41]. - The model supports seamless multi-turn dialogue and complex image editing tasks, including one-click makeup trials and character expression transformations [15][20][25]. Group 2: Technical Architecture - BAGEL employs a Mixture-of-Transformer-Experts (MoT) architecture, consisting of two Transformer experts focused on multi-modal understanding and generation, respectively [34]. - The model utilizes two independent visual encoders to capture pixel-level and semantic-level features, enhancing its understanding and generation capabilities [34]. - The training process revealed an "emerging properties" phenomenon, where advanced multi-modal reasoning capabilities develop progressively rather than appearing suddenly [36][37]. Group 3: Performance Metrics - In benchmark tests, BAGEL outperformed existing unified models like Janus-Pro and specialized understanding models, achieving notable scores across various metrics [40][41]. - The model's image editing capabilities are comparable to leading dedicated models, demonstrating its competitive edge in the AI landscape [48][49]. - BAGEL has been made available on Hugging Face under a permissive Apache 2.0 license, facilitating broader access and collaboration [50].