图像生成

Search documents
字节开源图像生成“六边形战士”,一个模型搞定人物/主体/风格保持
量子位· 2025-09-04 04:41
Core Viewpoint - Byte's UXO team has developed and open-sourced a unified framework called USO, which addresses the multi-indicator consistency problem in image generation, enabling simultaneous style transfer and subject retention across various tasks [1][19]. Group 1: Model Capabilities - USO can effectively manage subject, character, or style retention using a single model and just one reference image [7]. - The framework allows for diverse applications, such as generating cartoon characters in different scenarios, like driving a car or reading in a café, while maintaining high image quality comparable to commercial models [8][10][12][14]. - USO has been evaluated using a newly designed USO-Bench, which assesses performance across subject-driven, style-driven, and mixed generation tasks, outperforming several contemporary models [17][19]. Group 2: Performance Metrics - In the performance comparison, USO achieved a subject-driven generation score of 0.623 and a style-driven generation score of 0.557, placing it at the top among various models [18]. - User studies indicated that USO received high ratings across all evaluation dimensions, particularly in subject consistency, style consistency, and image quality [19]. Group 3: Innovative Techniques - USO employs a "cross-task self-decoupling" paradigm, enhancing the model's learning capabilities by allowing it to learn features relevant to different task types [21]. - The architecture is based on the open-source model FLUX.1 dev, incorporating style alignment training and content-style decoupling training [22]. - The introduction of a Style Reward Learning (SRL) algorithm, designed for Flow Matching, further promotes the decoupling of content and style through a mathematically mapped reward function [24][25]. Group 4: Data Framework - The team has created a cross-task data synthesis framework, innovatively constructing triplet data that includes both layout-changing and layout-preserving elements [30].
Nano Banana官方提示词来了,附完整代码示例
量子位· 2025-09-03 05:49
Core Viewpoint - The article discusses the rising popularity of the Nano-banana tool, highlighting its innovative features and the official guidelines released by Google to help users effectively utilize this technology [1][8]. Group 1: Features of Nano-banana - Nano-banana allows users to generate high-quality images from text descriptions, edit existing images with text prompts, and create new scenes using multiple images [15]. - The tool supports iterative refinement, enabling users to gradually adjust images until they achieve the desired outcome [15]. - It can accurately render text in images, making it suitable for logos, charts, and posters [15]. Group 2: Guidelines for Effective Use - Google emphasizes the importance of providing detailed scene descriptions rather than just listing keywords to generate better and more coherent images [9][10]. - Users are encouraged to think like photographers by considering camera angles, lighting, and fine details to achieve realistic images [19][20]. - The article provides specific prompt structures for various types of images, including photorealistic shots, stylized illustrations, product photography, and comic panels [20][24][35][43]. Group 3: Examples and Applications - The article showcases examples of images generated by Nano-banana, such as a cat dining in a luxurious restaurant under a starry sky, demonstrating the tool's capability to create detailed and imaginative scenes [14][17]. - It also includes code snippets for developers to integrate the image generation capabilities into their applications, highlighting the accessibility of the technology [21][29][35].
光学AI图像生成器能耗降至毫焦级
Ke Ji Ri Bao· 2025-08-29 00:32
Core Insights - A research team from the University of California, Los Angeles, has developed a new type of image generator that uses light beams instead of traditional computing hardware, significantly reducing energy consumption to one hundred-thousandth of standard AI tools, requiring only a few millijoules [1][2] Group 1: Technology Overview - Traditional digital diffusion models require hundreds to thousands of iterations to generate images, while the new system only needs initial encoding without additional computation [2] - The system utilizes a digital encoder trained on publicly available image datasets to create static encodings that can be converted into images [2] - The encoding is physically imprinted onto a laser beam using a Spatial Light Modulator (SLM), allowing for instant image presentation when the laser passes through a second SLM [2] Group 2: Performance and Applications - In tests, the new system generated simple images and Van Gogh-style paintings, achieving results comparable to traditional image generators [2] - The energy consumption for generating a Van Gogh-style image was approximately a few millijoules, while traditional diffusion models required hundreds to thousands of joules [2] - The low power characteristics of this system make it particularly suitable for applications in wearable devices, such as AI glasses [2]
腾讯申请图像生成相关专利,可对图像生成的逐步引导和稳健控制
Jin Rong Jie· 2025-08-16 09:19
Core Insights - Tencent Technology (Shenzhen) Co., Ltd. has applied for a patent titled "Image Generation Method, Device, Equipment, Medium, and Product" with publication number CN120495475A, filed on May 2025 [1] - The patent describes a method for generating images based on object input text, which includes processes for denoising random noise images and enhancing text prompts to create target images [1] Company Overview - Tencent Technology (Shenzhen) Co., Ltd. was established in 2000 and is located in Shenzhen, primarily engaged in software and information technology services [1] - The company has a registered capital of 2 million USD and has invested in 15 enterprises, participated in 264 bidding projects, and holds 5000 trademark and patent records, along with 534 administrative licenses [1]
Lumina-mGPT 2.0:自回归模型华丽复兴,媲美顶尖扩散模型
机器之心· 2025-08-12 00:15
Core Viewpoint - Lumina-mGPT 2.0 is an innovative stand-alone autoregressive image model that integrates various tasks such as text-to-image generation, subject-driven generation, and controllable generation, showcasing significant advancements in image generation technology [5][9][21]. Group 1: Core Technology and Breakthroughs - Lumina-mGPT 2.0 employs a fully independent training architecture, utilizing a pure decoder Transformer model, which allows for two parameter versions (2 billion and 7 billion) and avoids biases from pre-trained models [4][5]. - The model incorporates a high-quality image tokenizer, SBER-MoVQGAN, which was selected based on its optimal reconstruction quality on the MS-COCO dataset [7]. - A unified multi-task processing framework is introduced, enabling seamless support for various tasks including text-to-image generation and image editing [9]. Group 2: Efficient Inference Strategies - The model introduces two optimizations to enhance generation speed while maintaining quality, including model quantization to 4-bit integers and a sampling method that reduces GPU memory consumption by 60% [11][13]. - The optimizations allow for parallel decoding, significantly accelerating the generation process [13]. Group 3: Experimental Results - In text-to-image generation benchmarks, Lumina-mGPT 2.0 achieved a GenEval score of 0.80, ranking it among the top generative models, particularly excelling in tests involving "two objects" and "color attributes" [14][15]. - The model demonstrated superior performance in the Graph200K multi-task benchmark, confirming the feasibility of a pure autoregressive model for multi-modal generation tasks [17]. Group 4: Future Directions - Despite optimizations, Lumina-mGPT 2.0 still faces challenges with sampling time, which affects user experience, indicating a need for further enhancements [21]. - The focus will expand from multi-modal generation to include multi-modal understanding, aiming to improve overall functionality and performance [21].
Qwen新开源,把AI生图里的文字SOTA拉爆了
量子位· 2025-08-05 01:40
Core Viewpoint - The article discusses the release of Qwen-Image, a 20 billion parameter image generation model that excels in complex text rendering and image editing capabilities [3][28]. Group 1: Model Features - Qwen-Image is the first foundational image generation model in the Tongyi Qianwen series, utilizing the MMDiT architecture [4][3]. - It demonstrates exceptional performance in complex text rendering, supporting multi-line layouts and fine-grained detail presentation in both English and Chinese [28][32]. - The model also possesses consistent image editing capabilities, allowing for style transfer, modifications, detail enhancement, text editing, and pose adjustments [27][28]. Group 2: Performance Evaluation - Qwen-Image has achieved state-of-the-art (SOTA) performance across various public benchmark tests, including GenEval, DPG, OneIG-Bench for image generation, and GEdit, ImgEdit, GSO for image editing [29][30]. - In particular, it has shown significant superiority in Chinese text rendering compared to existing advanced models [33]. Group 3: Training Strategy - The model employs a progressive training strategy that transitions from non-text to text rendering, gradually moving from simple to complex text inputs, which enhances its native text rendering capabilities [34]. Group 4: Practical Applications - The article includes practical demonstrations of Qwen-Image's capabilities, such as generating illustrations, PPTs, and promotional images, showcasing its ability to accurately integrate text with visuals [11][21][24].
开源!通义千问推出系列中首个图像生成基础模型Qwen-Image
Hua Er Jie Jian Wen· 2025-08-04 21:09
Core Insights - The article discusses the launch of Qwen-Image, a 20 billion parameter MMDiT model, which is the first foundational model for image generation in the Tongyi Qwen series, achieving significant advancements in complex text rendering and precise image editing [1] Group 1 - Qwen-Image is a foundational model specifically designed for image generation [1] - The model has made notable progress in rendering complex text and editing images accurately [1]
训练时间减半,性能不降反升!腾讯混元开源图像生成高效强化方案MixGRPO
量子位· 2025-08-02 08:33
Core Viewpoint - The article introduces MixGRPO, a new framework that combines Stochastic Differential Equations (SDE) and Ordinary Differential Equations (ODE) to enhance the efficiency and performance of image generation processes [1][81]. Group 1: MixGRPO Framework - MixGRPO simplifies the optimization process in Markov Decision Processes (MDP) by utilizing a mixed sampling strategy, which improves both efficiency and performance [1][17]. - The framework shows significant improvements in human preference alignment across multiple dimensions, outperforming DanceGRPO with a training time reduction of nearly 50% [2][60]. - MixGRPO-Flash, a faster variant of MixGRPO, further reduces training time by 71% while maintaining similar performance levels [2][60]. Group 2: Performance Metrics - In comparative studies, MixGRPO achieved a higher Unified Reward score of 3.418, compared to DanceGRPO's 3.397, indicating better alignment with human preferences [60]. - MixGRPO-Flash demonstrated an average iteration time of 112.372 seconds, significantly lower than DanceGRPO's 291.284 seconds [60]. Group 3: Sampling Strategy - The MixGRPO framework employs a hybrid sampling method, where SDE sampling is used within a defined interval during the denoising process, while ODE sampling is applied outside this interval [14][20]. - This approach allows for a reduction in computational overhead and optimization difficulty, while ensuring that the sampling process remains aligned with the marginal distributions of SDE and ODE [30][81]. Group 4: Sliding Window Strategy - A sliding window strategy is introduced to optimize the denoising steps, allowing the model to focus on specific time steps during training [32][35]. - The research team identified key hyperparameters for the sliding window, including window size and movement intervals, which significantly impact performance [34][70]. Group 5: High-Order ODE Solvers - The integration of high-order ODE solvers, such as DPM-Solver++, enhances the sampling speed during the GRPO training process, balancing computational cost and performance [45][76]. - The experiments indicated that a second-order midpoint method was optimal for the high-order solver settings [76]. Group 6: Experimental Validation - The experiments utilized the HPDv2 dataset, which includes diverse prompts, demonstrating that MixGRPO can achieve effective human preference alignment with a limited number of training prompts [49][50]. - The results from various reward models confirmed the robustness of MixGRPO, showing superior performance in both single and multi-reward settings [56][82].
Manus突发上新文生图!告别“抽卡”,Agent+深度思考联合创作
量子位· 2025-05-16 05:36
Core Viewpoint - Manus has announced its new feature that supports image generation, which differs from typical AI drawing tools by understanding the user's intent and planning the generation process before execution [1][18]. Group 1: Image Generation Capabilities - Manus can analyze a room's style based on elements like flooring and walls, creating an analysis report before generating visual designs [5][4]. - The tool can search for furniture on websites like IKEA, select suitable items, and provide links along with visual results [7][3]. - Manus has demonstrated its ability to design a beverage bottle for a tea drink called "TeaVive," focusing on appealing to the youth market by analyzing popular visual elements [11]. Group 2: User Experience and Feedback - Users have praised the integration of intelligent workflows with image generation as a great idea [6]. - Some users have expressed concerns about the pricing of the service, with one user noting that a monthly subscription of $39 only allows for limited usage [26][28]. - The registration process for Manus has been simplified, now offering 1000 points upon registration and daily bonuses [22]. Group 3: Competitive Landscape - The emergence of a competing platform, Lovart, which also focuses on design, has prompted Manus to enhance its offerings [18][20]. - Lovart has gained popularity quickly, similar to Manus's initial launch, indicating a competitive environment in the design AI space [19].
Manus推出图像生成功能
news flash· 2025-05-16 05:21
Core Insights - Manus has launched an image generation feature that not only creates images but also understands user intent and plans solutions [1] Company Overview - Manus is positioned to effectively utilize image generation and other tools to accomplish user tasks [1]