图像生成
Search documents
字节图像生成新模型Seedream 5.0 Preview上线 支持2K和4K分辨率输出
Zhi Tong Cai Jing· 2026-02-10 07:29
据悉,Seedream5.0对标Nano Banana Pro,并且更便宜,目前所有用户可免费使用20次。此款新模型主 打检索生图、提示词理解升级、精准调整、4K增强等实用功能,或以贴近用户实际需求为迭代方向, 但与同系Seedance2.0相比实测缺乏跨越式突破。 Seedream5.0的图像支持2K和4K分辨率输出,2K为图片生成直出,4K为AI增强后的分辨率。根据Capcut 官网,新模型5.0的升级点为首次支持检索生图,对提示词的理解准确性增强、支持更细节、精致纹理 的图像生成,还允许用户精确调整图像。 2月10日,字节跳动旗下图像生成模型Seedream5.0Preview在字节视频编辑应用剪映、字节AI创作平台 小云雀上线,并在即梦AI平台开启灰度测试,图片生成可限时免费体验。 ...
字节图像生成模型Seedream 5.0上线
Mei Ri Jing Ji Xin Wen· 2026-02-10 06:21
(文章来源:每日经济新闻) 每经北京2月10日电(记者李宇彤)字节图像生成模型Seedream 5.0 Preview2月10日在字节视频编辑应用 剪映、字节AI创作平台小云雀上线,并在即梦AI平台开启灰度测试。《每日经济新闻》记者登录即 梦、剪映等平台发现,Seedream 5.0 Preview的图像支持2K和4K分辨率输出,目前用户可以在即梦平台 免费体验2K输出效果。 ...
反超Nano Banana!OpenAI旗舰图像生成模型上线
量子位· 2025-12-17 01:04
Core Viewpoint - OpenAI has launched its new image generation model, GPT-Image-1.5, which aims to enhance practical usability and compete directly with other leading models in the market [2][13][14]. Summary by Sections Model Features - The new model introduces four main highlights: improved instruction adherence, precise editing, better detail retention, and a speed increase of up to four times compared to its predecessor [3][5][14]. - GPT-Image-1.5 is designed to maintain consistency in key elements such as lighting, composition, and character appearance during input, output, and multi-round editing [15][19]. Performance and Comparisons - In benchmark tests, GPT-Image-1.5 has been rated first in both text-to-image and image editing categories, surpassing the Nano Banana Pro [33]. - The model's instruction adherence rate is reported to be as high as 90%, indicating a significant lead over competitors [35]. Pricing and Accessibility - The API for GPT-Image-1.5 has seen a 20% reduction in input and output costs compared to the previous version [39]. - Pricing varies by resolution, with high-quality images costing approximately $133 per thousand and low-quality images around $9 per thousand [40]. Market Positioning - OpenAI is positioning GPT-Image-1.5 as a productivity tool with its focus on fine editing capabilities and reduced pricing, indicating a strategic shift towards enhancing practical applications [41]. - The model is now available to all ChatGPT users and API users globally, marking a significant step in OpenAI's product offerings [38].
刚刚,OpenAI推出全新ChatGPT Images,奥特曼亮出腹肌搞宣传
3 6 Ke· 2025-12-17 01:04
Core Insights - OpenAI has launched a new version of ChatGPT Images, which enhances image generation and editing capabilities, aiming to simplify the user experience and broaden accessibility [25][58]. Group 1: Product Features - The new ChatGPT Images is powered by OpenAI's flagship image generation model, allowing users to create and edit images with precision while maintaining key details [25]. - The model supports various editing functions, including adding, removing, combining, and replacing elements in images [26]. - It features a transformation capability that allows users to change and add elements while preserving important details, making it suitable for both simple and complex concepts [37]. Group 2: User Experience Enhancements - OpenAI has introduced a new "Images" feature in ChatGPT, designed to make the image generation experience more enjoyable and effortless, with numerous preset filters and prompts to inspire creativity [56]. - The new model is accessible through mobile applications and chatgpt.com, streamlining the image exploration process [56]. - The price for image input and output has been reduced by 20% compared to the previous version, enabling users to generate and iterate more images within the same budget [58]. Group 3: Competitive Landscape - The launch of ChatGPT Images signifies a shift in competition from pure model capabilities to a comprehensive product experience [62]. - OpenAI's strategy includes lowering psychological barriers for users by introducing an independent "Images" entry point and preset style filters, making image generation as simple as tweeting [62].
刚刚,OpenAI推出全新ChatGPT Images,奥特曼亮出腹肌搞宣传
机器之心· 2025-12-17 00:00
Core Viewpoint - OpenAI has launched a new version of ChatGPT Images, enhancing image generation and editing capabilities, aiming to simplify user interaction and broaden accessibility in creative processes [10][34][44]. Group 1: New Features and Improvements - The new ChatGPT Images is powered by OpenAI's flagship image generation model, offering precise editing while maintaining key details, with a fourfold increase in image generation speed [10][11]. - The model excels in various editing types, including adding, removing, combining, and replacing elements, allowing for detailed transformations while preserving important aspects of the original image [12][15]. - Enhanced instruction adherence enables the model to follow user commands more reliably, resulting in more accurate edits and better handling of complex compositions [24]. Group 2: User Experience and Accessibility - The updated Images feature is designed to make the image generation experience more enjoyable and effortless, with numerous preset filters and prompts to inspire creativity [34][44]. - The new model is available to all ChatGPT users and offers a 20% reduction in image input and output costs compared to the previous version, allowing for more image generation within the same budget [37]. - OpenAI aims to lower the psychological barrier for users by introducing an independent "Images" entry point and simplifying the interaction process, making it as easy as posting on social media [44]. Group 3: Competitive Landscape - The release of ChatGPT Images signifies a shift in the competitive landscape of AI image generation, moving from a focus on model capabilities to a comprehensive product experience [43]. - OpenAI has not released quantitative benchmark results for this update, indicating a strategic emphasis on user experience rather than purely technical performance metrics [43].
豆包图像创作模型Seedream 4.5发布
Mei Ri Jing Ji Xin Wen· 2025-12-03 11:51
Core Viewpoint - The company, Huoshan Engine, has officially launched the Doubao-Seedream-4.5 image creation model on December 3, which is now available for public testing, showcasing significant improvements in various aspects of image generation quality and stability [1] Group 1 - The new model demonstrates advancements in subject consistency, instruction adherence accuracy, spatial logic understanding, and aesthetic expressiveness [1] - The overall quality and stability of image generation have been further enhanced with this iteration of the model [1]
NeurIPS 2025 Oral | 1个Token零成本,REG让Diffusion训练收敛快20倍!
机器之心· 2025-11-29 01:49
Core Insights - REG is a simple and effective method that significantly accelerates the training convergence of generative models by introducing a class token, enhancing the performance of diffusion models [2][9][38] Group 1: Methodology - REG combines low-level latent representations with high-level class tokens from pre-trained visual models, allowing for simultaneous noise addition and denoising optimization during training [9][14] - The training process requires only the addition of one token, resulting in a computational overhead of less than 0.5% while not increasing inference costs [9][10][26] - REG achieves a 63x and 23x acceleration in convergence speed compared to SiT-XL/2 and SiT-XL/2+REPA, respectively, on ImageNet 256×256 [10][17] Group 2: Performance Metrics - In terms of FID scores, REG outperforms REPA significantly, achieving a FID of 1.8 after 4 million training steps, while SiT-XL/2+REPA achieves a FID of 5.9 [17][19] - REG shows a reduction in training time by 97.90% compared to SiT-XL/2 while maintaining similar FID scores [24][25] - The inference overhead for REG is minimal, with increases in parameters, FLOPs, and latency being less than 0.5%, while FID scores improve by 56.46% compared to SiT-XL/2 + REPA [26][27] Group 3: Ablation Studies - Extensive ablation studies demonstrate the effectiveness of REG, showing that high-level global discriminative information significantly enhances generation quality [28][30] - The introduction of the DINOv2 class token leads to the best performance in generating quality images, indicating the importance of high-level semantic guidance [30][31] Group 4: Conclusion - Overall, REG represents a highly efficient training paradigm that integrates high-level and low-level token entanglement, promoting a "understanding-generating" decoupling in generative models, leading to superior generation outcomes without increasing inference costs [38]
RAE的终极形态?北大&阿里提出UniLIP: 将CLIP拓展到重建、生成和编辑
机器之心· 2025-11-02 08:01
Core Insights - The article discusses the innovative UniLIP model, which addresses the trade-off between semantic understanding and pixel detail retention in unified multimodal models [2][4][32] - UniLIP achieves state-of-the-art (SOTA) performance in various benchmarks while maintaining or slightly improving understanding capabilities compared to larger models [5][26] Methodology - UniLIP employs a two-stage training framework with self-distillation loss to enhance image reconstruction capabilities without sacrificing original understanding performance [4][11] - The first stage involves aligning the decoder while freezing the CLIP model, focusing on learning to reconstruct images from fixed CLIP features [9][11] - The second stage jointly trains CLIP and applies self-distillation to ensure feature consistency while injecting pixel details [11][12] Performance Metrics - UniLIP models (1B and 3B parameters) achieved SOTA results in benchmarks such as GenEval (0.90), WISE (0.63), and ImgEdit (3.94) [5][26][27] - In image reconstruction, UniLIP outperformed previous quantization methods and demonstrated significant advantages in generation efficiency [22][24] Architectural Design - The architecture of UniLIP integrates InternVL3 and SANA, utilizing InternViT as the CLIP encoder and a pixel decoder from DC-AE [20] - The model is designed with a connector structure that maintains consistency with large language models (LLMs) [20] Training Data - UniLIP's training data includes 38 million pre-training samples and 60,000 instruction fine-tuning samples for generation, along with 1.5 million editing samples [21] Image Generation and Editing - UniLIP excels in both image generation and editing tasks, achieving high scores in benchmarks due to its rich feature representation and precise semantic alignment [26][27][30] - The dual-condition architecture effectively connects MLLM with diffusion models, ensuring high fidelity and consistency in generated and edited images [18][32]
VAE再被补刀!清华快手SVG扩散模型亮相,训练提效6200%,生成提速3500%
量子位· 2025-10-28 05:12
Core Viewpoint - The article discusses the transition from Variational Autoencoders (VAE) to new models like SVG developed by Tsinghua University and Kuaishou, highlighting significant improvements in training efficiency and generation speed, as well as addressing the limitations of VAE in semantic entanglement [1][4][10]. Group 1: VAE Limitations and New Approaches - VAE is being abandoned due to its semantic entanglement issue, where adjusting one feature affects others, complicating the generation process [4][8]. - The SVG model achieves a 62-fold improvement in training efficiency and a 35-fold increase in generation speed compared to traditional methods [3][10]. - The RAE approach focuses solely on enhancing generation performance by reusing pre-trained encoders, while SVG aims for multi-task versatility by constructing a feature space that integrates semantics and details [11][12]. Group 2: SVG Model Details - SVG utilizes the DINOv3 pre-trained model for semantic extraction, effectively distinguishing features of different categories like cats and dogs, thus resolving semantic entanglement [14]. - A lightweight residual encoder is added to capture high-frequency details that DINOv3 may overlook, ensuring a comprehensive feature representation [14]. - The distribution alignment mechanism is crucial for maintaining the integrity of semantic structures while integrating detail features, as evidenced by a significant increase in FID values when this mechanism is removed [15][16]. Group 3: Performance Metrics - In experiments, SVG outperformed traditional VAE models in various metrics, achieving a FID score of 6.57 on the ImageNet dataset after 80 epochs, compared to 22.58 for the VAE-based SiT-XL [18]. - The model's efficiency is further demonstrated with a FID score dropping to 1.92 after 1400 epochs, nearing the performance of top-tier generative models [18]. - SVG's feature space is versatile, allowing for direct application in tasks like image classification and semantic segmentation without the need for fine-tuning, achieving an 81.8% Top-1 accuracy on ImageNet-1K [22].
谢赛宁新作:VAE退役,RAE当立
量子位· 2025-10-14 08:16
Core Viewpoint - The era of Variational Autoencoders (VAE) is coming to an end, with Representation Autoencoders (RAE) set to take over in the field of diffusion models [1][3]. Summary by Sections RAE Introduction - RAE is a new type of autoencoder designed for training diffusion Transformers (DiT), utilizing pre-trained representation encoders (like DINO, SigLIP, MAE) paired with lightweight decoders, replacing the traditional VAE [3][9]. Advantages of RAE - RAE provides high-quality reconstruction results and a semantically rich latent space, supporting scalable transformer-based architectures. It achieves faster convergence without the need for additional representation alignment losses [4][10]. Performance Metrics - At a resolution of 256×256, the FID score without guidance is 1.51, and with guidance, it is 1.13 for both 256×256 and 512×512 resolutions [6]. Limitations of VAE - VAE has outdated backbone networks, leading to overly complex architectures, requiring 450 GFLOPs compared to only 22 GFLOPs for a simple ViT-B encoder [7]. - The compressed latent space of VAE (only 4 channels) severely limits information capacity, resulting in minimal improvement in information carrying ability [7]. - VAE's weak representation capability, relying solely on reconstruction training, leads to low feature quality and slows down convergence, negatively impacting generation quality [7]. RAE's Design and Training - RAE combines pre-trained representation encoders with trained decoders without requiring additional training or alignment phases, and it does not introduce auxiliary loss functions [9]. - RAE outperforms SD-VAE in reconstruction quality despite its simplicity [10]. Model Comparisons - RAE models such as DINOv2-B, SigLIP2-B, and MAE-B show significant improvements in rFID and Top-1 accuracy compared to SD-VAE [11]. Adjustments for Diffusion Models - RAE requires simple adjustments for effective performance in high-dimensional spaces, including a wide DiT design, noise scheduling, and noise injection in the decoder training [13][17]. - The DiT-XL model trained with RAE surpasses REPA without any auxiliary losses or additional training phases, achieving convergence speeds up to 16 times faster than REPA based on SD-VAE [18][19]. Scalability and Efficiency - The new architecture enhances the scalability of DiT in terms of training computation and model size, outperforming both standard DiT based on RAE and traditional methods based on VAE [24].