图像生成
Search documents
反超Nano Banana!OpenAI旗舰图像生成模型上线
量子位· 2025-12-17 01:04
Jay 发自 凹非寺 量子位 | 公众号 QbitAI OpenAI的红色预警,还在发力。 憋了大半年的图像生成模型—— GPT-Image-1.5 ,终于发布。 据官方表示,本次更新主要有四个亮点: 拯救这个被烤焦的饼。 感觉……这是要全面对标Nano Banana了啊。 目前的玩法也很类似。比如,将汽车颜色改为橙色。 更严谨的指令遵循; 精确编辑; 细节保留; 速度比以前快4倍。 根据奶昔架、芝士汉堡等元素,做一个复古风格的餐馆广告。 指令遵守和精确编辑方面,的确比之前强了不少。 而且今天就能用上了,GPT-Image-1.5将在ChatGPT中面向所有用户推出,并在API中作为GPT Image 1.5推出。 拍摄一张20世纪70年代伦敦切尔西的场景照片,画面要逼真,所有景物清晰对焦,人群密集,还有一辆公交车,车身贴着「ImageGen 1.5」的广告,广告上印有OpenAI标志和「创造你的想象」的副标题。整体风格像业余摄影作品,iPhone快照画质…… OpenAI最强图像生成模型 被谷歌一轮正面「拷打」之后,OpenAI藏了大半年的GPT-Image-1.5,终于憋不住了。 这个旗舰级图像生成模型, ...
刚刚,OpenAI推出全新ChatGPT Images,奥特曼亮出腹肌搞宣传
3 6 Ke· 2025-12-17 01:04
Core Insights - OpenAI has launched a new version of ChatGPT Images, which enhances image generation and editing capabilities, aiming to simplify the user experience and broaden accessibility [25][58]. Group 1: Product Features - The new ChatGPT Images is powered by OpenAI's flagship image generation model, allowing users to create and edit images with precision while maintaining key details [25]. - The model supports various editing functions, including adding, removing, combining, and replacing elements in images [26]. - It features a transformation capability that allows users to change and add elements while preserving important details, making it suitable for both simple and complex concepts [37]. Group 2: User Experience Enhancements - OpenAI has introduced a new "Images" feature in ChatGPT, designed to make the image generation experience more enjoyable and effortless, with numerous preset filters and prompts to inspire creativity [56]. - The new model is accessible through mobile applications and chatgpt.com, streamlining the image exploration process [56]. - The price for image input and output has been reduced by 20% compared to the previous version, enabling users to generate and iterate more images within the same budget [58]. Group 3: Competitive Landscape - The launch of ChatGPT Images signifies a shift in competition from pure model capabilities to a comprehensive product experience [62]. - OpenAI's strategy includes lowering psychological barriers for users by introducing an independent "Images" entry point and preset style filters, making image generation as simple as tweeting [62].
刚刚,OpenAI推出全新ChatGPT Images,奥特曼亮出腹肌搞宣传
机器之心· 2025-12-17 00:00
Core Viewpoint - OpenAI has launched a new version of ChatGPT Images, enhancing image generation and editing capabilities, aiming to simplify user interaction and broaden accessibility in creative processes [10][34][44]. Group 1: New Features and Improvements - The new ChatGPT Images is powered by OpenAI's flagship image generation model, offering precise editing while maintaining key details, with a fourfold increase in image generation speed [10][11]. - The model excels in various editing types, including adding, removing, combining, and replacing elements, allowing for detailed transformations while preserving important aspects of the original image [12][15]. - Enhanced instruction adherence enables the model to follow user commands more reliably, resulting in more accurate edits and better handling of complex compositions [24]. Group 2: User Experience and Accessibility - The updated Images feature is designed to make the image generation experience more enjoyable and effortless, with numerous preset filters and prompts to inspire creativity [34][44]. - The new model is available to all ChatGPT users and offers a 20% reduction in image input and output costs compared to the previous version, allowing for more image generation within the same budget [37]. - OpenAI aims to lower the psychological barrier for users by introducing an independent "Images" entry point and simplifying the interaction process, making it as easy as posting on social media [44]. Group 3: Competitive Landscape - The release of ChatGPT Images signifies a shift in the competitive landscape of AI image generation, moving from a focus on model capabilities to a comprehensive product experience [43]. - OpenAI has not released quantitative benchmark results for this update, indicating a strategic emphasis on user experience rather than purely technical performance metrics [43].
豆包图像创作模型Seedream 4.5发布
Mei Ri Jing Ji Xin Wen· 2025-12-03 11:51
每经AI快讯,据火山引擎官微消息,12月3日,火山引擎正式发布豆包图像创作模型Doubao-Seedream- 4.5,现面向用户开启公测。新一代模型在主体一致性、指令遵循精准度、空间逻辑理解及美学表现力 等方面实现迭代,进一步提升了图像生成的整体质量与稳定性。 ...
NeurIPS 2025 Oral | 1个Token零成本,REG让Diffusion训练收敛快20倍!
机器之心· 2025-11-29 01:49
Core Insights - REG is a simple and effective method that significantly accelerates the training convergence of generative models by introducing a class token, enhancing the performance of diffusion models [2][9][38] Group 1: Methodology - REG combines low-level latent representations with high-level class tokens from pre-trained visual models, allowing for simultaneous noise addition and denoising optimization during training [9][14] - The training process requires only the addition of one token, resulting in a computational overhead of less than 0.5% while not increasing inference costs [9][10][26] - REG achieves a 63x and 23x acceleration in convergence speed compared to SiT-XL/2 and SiT-XL/2+REPA, respectively, on ImageNet 256×256 [10][17] Group 2: Performance Metrics - In terms of FID scores, REG outperforms REPA significantly, achieving a FID of 1.8 after 4 million training steps, while SiT-XL/2+REPA achieves a FID of 5.9 [17][19] - REG shows a reduction in training time by 97.90% compared to SiT-XL/2 while maintaining similar FID scores [24][25] - The inference overhead for REG is minimal, with increases in parameters, FLOPs, and latency being less than 0.5%, while FID scores improve by 56.46% compared to SiT-XL/2 + REPA [26][27] Group 3: Ablation Studies - Extensive ablation studies demonstrate the effectiveness of REG, showing that high-level global discriminative information significantly enhances generation quality [28][30] - The introduction of the DINOv2 class token leads to the best performance in generating quality images, indicating the importance of high-level semantic guidance [30][31] Group 4: Conclusion - Overall, REG represents a highly efficient training paradigm that integrates high-level and low-level token entanglement, promoting a "understanding-generating" decoupling in generative models, leading to superior generation outcomes without increasing inference costs [38]
RAE的终极形态?北大&阿里提出UniLIP: 将CLIP拓展到重建、生成和编辑
机器之心· 2025-11-02 08:01
Core Insights - The article discusses the innovative UniLIP model, which addresses the trade-off between semantic understanding and pixel detail retention in unified multimodal models [2][4][32] - UniLIP achieves state-of-the-art (SOTA) performance in various benchmarks while maintaining or slightly improving understanding capabilities compared to larger models [5][26] Methodology - UniLIP employs a two-stage training framework with self-distillation loss to enhance image reconstruction capabilities without sacrificing original understanding performance [4][11] - The first stage involves aligning the decoder while freezing the CLIP model, focusing on learning to reconstruct images from fixed CLIP features [9][11] - The second stage jointly trains CLIP and applies self-distillation to ensure feature consistency while injecting pixel details [11][12] Performance Metrics - UniLIP models (1B and 3B parameters) achieved SOTA results in benchmarks such as GenEval (0.90), WISE (0.63), and ImgEdit (3.94) [5][26][27] - In image reconstruction, UniLIP outperformed previous quantization methods and demonstrated significant advantages in generation efficiency [22][24] Architectural Design - The architecture of UniLIP integrates InternVL3 and SANA, utilizing InternViT as the CLIP encoder and a pixel decoder from DC-AE [20] - The model is designed with a connector structure that maintains consistency with large language models (LLMs) [20] Training Data - UniLIP's training data includes 38 million pre-training samples and 60,000 instruction fine-tuning samples for generation, along with 1.5 million editing samples [21] Image Generation and Editing - UniLIP excels in both image generation and editing tasks, achieving high scores in benchmarks due to its rich feature representation and precise semantic alignment [26][27][30] - The dual-condition architecture effectively connects MLLM with diffusion models, ensuring high fidelity and consistency in generated and edited images [18][32]
VAE再被补刀!清华快手SVG扩散模型亮相,训练提效6200%,生成提速3500%
量子位· 2025-10-28 05:12
Core Viewpoint - The article discusses the transition from Variational Autoencoders (VAE) to new models like SVG developed by Tsinghua University and Kuaishou, highlighting significant improvements in training efficiency and generation speed, as well as addressing the limitations of VAE in semantic entanglement [1][4][10]. Group 1: VAE Limitations and New Approaches - VAE is being abandoned due to its semantic entanglement issue, where adjusting one feature affects others, complicating the generation process [4][8]. - The SVG model achieves a 62-fold improvement in training efficiency and a 35-fold increase in generation speed compared to traditional methods [3][10]. - The RAE approach focuses solely on enhancing generation performance by reusing pre-trained encoders, while SVG aims for multi-task versatility by constructing a feature space that integrates semantics and details [11][12]. Group 2: SVG Model Details - SVG utilizes the DINOv3 pre-trained model for semantic extraction, effectively distinguishing features of different categories like cats and dogs, thus resolving semantic entanglement [14]. - A lightweight residual encoder is added to capture high-frequency details that DINOv3 may overlook, ensuring a comprehensive feature representation [14]. - The distribution alignment mechanism is crucial for maintaining the integrity of semantic structures while integrating detail features, as evidenced by a significant increase in FID values when this mechanism is removed [15][16]. Group 3: Performance Metrics - In experiments, SVG outperformed traditional VAE models in various metrics, achieving a FID score of 6.57 on the ImageNet dataset after 80 epochs, compared to 22.58 for the VAE-based SiT-XL [18]. - The model's efficiency is further demonstrated with a FID score dropping to 1.92 after 1400 epochs, nearing the performance of top-tier generative models [18]. - SVG's feature space is versatile, allowing for direct application in tasks like image classification and semantic segmentation without the need for fine-tuning, achieving an 81.8% Top-1 accuracy on ImageNet-1K [22].
谢赛宁新作:VAE退役,RAE当立
量子位· 2025-10-14 08:16
Core Viewpoint - The era of Variational Autoencoders (VAE) is coming to an end, with Representation Autoencoders (RAE) set to take over in the field of diffusion models [1][3]. Summary by Sections RAE Introduction - RAE is a new type of autoencoder designed for training diffusion Transformers (DiT), utilizing pre-trained representation encoders (like DINO, SigLIP, MAE) paired with lightweight decoders, replacing the traditional VAE [3][9]. Advantages of RAE - RAE provides high-quality reconstruction results and a semantically rich latent space, supporting scalable transformer-based architectures. It achieves faster convergence without the need for additional representation alignment losses [4][10]. Performance Metrics - At a resolution of 256×256, the FID score without guidance is 1.51, and with guidance, it is 1.13 for both 256×256 and 512×512 resolutions [6]. Limitations of VAE - VAE has outdated backbone networks, leading to overly complex architectures, requiring 450 GFLOPs compared to only 22 GFLOPs for a simple ViT-B encoder [7]. - The compressed latent space of VAE (only 4 channels) severely limits information capacity, resulting in minimal improvement in information carrying ability [7]. - VAE's weak representation capability, relying solely on reconstruction training, leads to low feature quality and slows down convergence, negatively impacting generation quality [7]. RAE's Design and Training - RAE combines pre-trained representation encoders with trained decoders without requiring additional training or alignment phases, and it does not introduce auxiliary loss functions [9]. - RAE outperforms SD-VAE in reconstruction quality despite its simplicity [10]. Model Comparisons - RAE models such as DINOv2-B, SigLIP2-B, and MAE-B show significant improvements in rFID and Top-1 accuracy compared to SD-VAE [11]. Adjustments for Diffusion Models - RAE requires simple adjustments for effective performance in high-dimensional spaces, including a wide DiT design, noise scheduling, and noise injection in the decoder training [13][17]. - The DiT-XL model trained with RAE surpasses REPA without any auxiliary losses or additional training phases, achieving convergence speeds up to 16 times faster than REPA based on SD-VAE [18][19]. Scalability and Efficiency - The new architecture enhances the scalability of DiT in terms of training computation and model size, outperforming both standard DiT based on RAE and traditional methods based on VAE [24].
字节开源图像生成“六边形战士”,一个模型搞定人物/主体/风格保持
量子位· 2025-09-04 04:41
Core Viewpoint - Byte's UXO team has developed and open-sourced a unified framework called USO, which addresses the multi-indicator consistency problem in image generation, enabling simultaneous style transfer and subject retention across various tasks [1][19]. Group 1: Model Capabilities - USO can effectively manage subject, character, or style retention using a single model and just one reference image [7]. - The framework allows for diverse applications, such as generating cartoon characters in different scenarios, like driving a car or reading in a café, while maintaining high image quality comparable to commercial models [8][10][12][14]. - USO has been evaluated using a newly designed USO-Bench, which assesses performance across subject-driven, style-driven, and mixed generation tasks, outperforming several contemporary models [17][19]. Group 2: Performance Metrics - In the performance comparison, USO achieved a subject-driven generation score of 0.623 and a style-driven generation score of 0.557, placing it at the top among various models [18]. - User studies indicated that USO received high ratings across all evaluation dimensions, particularly in subject consistency, style consistency, and image quality [19]. Group 3: Innovative Techniques - USO employs a "cross-task self-decoupling" paradigm, enhancing the model's learning capabilities by allowing it to learn features relevant to different task types [21]. - The architecture is based on the open-source model FLUX.1 dev, incorporating style alignment training and content-style decoupling training [22]. - The introduction of a Style Reward Learning (SRL) algorithm, designed for Flow Matching, further promotes the decoupling of content and style through a mathematically mapped reward function [24][25]. Group 4: Data Framework - The team has created a cross-task data synthesis framework, innovatively constructing triplet data that includes both layout-changing and layout-preserving elements [30].
Nano Banana官方提示词来了,附完整代码示例
量子位· 2025-09-03 05:49
Core Viewpoint - The article discusses the rising popularity of the Nano-banana tool, highlighting its innovative features and the official guidelines released by Google to help users effectively utilize this technology [1][8]. Group 1: Features of Nano-banana - Nano-banana allows users to generate high-quality images from text descriptions, edit existing images with text prompts, and create new scenes using multiple images [15]. - The tool supports iterative refinement, enabling users to gradually adjust images until they achieve the desired outcome [15]. - It can accurately render text in images, making it suitable for logos, charts, and posters [15]. Group 2: Guidelines for Effective Use - Google emphasizes the importance of providing detailed scene descriptions rather than just listing keywords to generate better and more coherent images [9][10]. - Users are encouraged to think like photographers by considering camera angles, lighting, and fine details to achieve realistic images [19][20]. - The article provides specific prompt structures for various types of images, including photorealistic shots, stylized illustrations, product photography, and comic panels [20][24][35][43]. Group 3: Examples and Applications - The article showcases examples of images generated by Nano-banana, such as a cat dining in a luxurious restaurant under a starry sky, demonstrating the tool's capability to create detailed and imaginative scenes [14][17]. - It also includes code snippets for developers to integrate the image generation capabilities into their applications, highlighting the accessibility of the technology [21][29][35].