文生图

Search documents
让AI作画自己纠错!随机丢模块就能提升生成质量,告别塑料感废片
量子位· 2025-08-23 05:06
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI AI作画、生视频,可以「自己救自己」了?! 当大家还在为CFG(无分类器引导)的参数搞到头秃,却依然得到一堆"塑料感"废片而发愁时,来自清华大学、阿里巴巴AMAP(高德地 图)、中国科学院自动化研究所的研究团队,推出全新方法 S²-Guidance (Stochastic Self-Guidance)。 核心亮点在于通过 随机丢弃网络模块(Stochastic Block-Dropping)来动态构建"弱"的子网络,从而实现对生成过程的自我修正。这不仅 让AI学会了"主动避坑",更重要的是,它避免了其他类似方法中繁琐、针对特定模型的参数调整过程 ,真正做到了即插即用、效果显著。 S²-Guidance方法在文生图和文生视频任务中,显著提升了生成结果的质量与连贯性。 具体表现在: 一、CFG的瓶颈:效果失真 + 缺乏通用性 在扩散模型的世界里,CFG (Classifier-Free Guidance)是提升生成质量和文本对齐度的标准操作。但它的"线性外推"本质,导致高引导强度 下容易产生过饱和、失真等问题。 为了解决这个问题,学术界此前的思路是引入一个"监 ...
Qwen-Image 模型上线基石智算,快来体验超强文本渲染能力
Sou Hu Cai Jing· 2025-08-14 15:48
Core Insights - Qwen-Image, the first text-to-image foundational model from the Qwen series, has been launched by Qiyun Technology's AI computing cloud, CoresHub, featuring 20 billion parameters and developed by Alibaba's Tongyi Qianwen team [1] - The model excels in complex text rendering, precise image editing, multi-line layout, paragraph-level generation, and detail depiction, making it particularly effective in poster design scenarios [1] Model Highlights - Exceptional text rendering capabilities: Qwen-Image demonstrates outstanding performance in complex text generation and rendering, supporting multi-line typesetting, paragraph-level layout, and fine-grained detail presentation in both English and Chinese [2] - Consistency in image editing: Leveraging enhanced multi-task training paradigms, Qwen-Image can accurately modify target areas during image editing while maintaining overall visual consistency and semantic coherence [2] - Industry-leading performance: Multiple public benchmark test results indicate that Qwen-Image has achieved state-of-the-art (SOTA) results in various image generation and editing tasks, validating its comprehensive strength [2] Usage Steps - Users can log into CoresHub, navigate to the model plaza, select the Qwen-Image model, and click on model deployment [3] - The model can be deployed by selecting a single card 4090D resource type, and after successful deployment, users can copy the external link to open in a browser [4] - Once the Comfy UI page loads successfully, users can select the Qwen-Image template and input their prompt [6] Effect Demonstration - Various prompts showcase the capabilities of Qwen-Image, including imaginative scenarios such as a Shiba Inu wearing a cowboy hat at a bar, a cotton candy castle in the clouds, and a retro arcade with a pixel-style game machine [9][11][12][13]
2025年中国多模态大模型行业模型现状 图像、视频、音频、3D模型等终将打通和融合【组图】
Qian Zhan Wang· 2025-06-01 05:09
Core Insights - The exploration of multimodal large models is making gradual progress, with a focus on breakthroughs in visual modalities, aiming for an "Any-to-Any" model that requires successful pathways across various modalities [1] - The industry is currently concentrating on enhancing perception and generation models in image, video, and 3D modalities, with the goal of achieving cross-modal integration and sharing [1] Multimodal Large Models in Image - Prior to the rise of LLMs in 2023, the industry had already established a solid foundation in image understanding and generation, resulting in models like CLIP, Stable Diffusion, and GAN, which led to applications such as Midjourney and DALL·E [2] - The industry is actively exploring the integration of Transformer models into image-related tasks, with significant outcomes including GLIP, SAM, and GPT-V [2] Multimodal Large Models in Video - Video generation is being approached by transferring image generation models to video, utilizing image data for training and aligning temporal dimensions to achieve text-to-video results [5] - Recent advancements include models like VideoLDM and Sora, which demonstrate significant breakthroughs in video generation using the Diffusion Transformer architecture [5] Multimodal Large Models in 3D - The generation of 3D models is being explored by extending 2D image generation methods, with key models such as 3D GAN, MeshDiffusion, and Instant3D emerging in the industry [8][9] - 3D data representation includes various formats like meshes, point clouds, and NeRF, with NeRF being a critical technology for 3D data representation [9] Multimodal Large Models in Audio - AI technologies related to audio have matured, with recent applications of Transformer models enhancing audio understanding and generation, exemplified by projects like Whisper large-v3 and VALL-E [11] - The evolution of speech technology is categorized into three stages, with a focus on enhancing generalization capabilities across multiple languages and tasks [11]
鹅厂放大招,混元图像2.0「边说边画」:描述完,图也生成好了
量子位· 2025-05-16 03:39
Core Viewpoint - Tencent has launched the Hunyuan Image 2.0 model, which enables real-time image generation with millisecond response times, allowing users to create images seamlessly while describing them verbally or through sketches [1][6]. Group 1: Features of Hunyuan Image 2.0 - The model supports real-time drawing boards where users can sketch elements and provide text descriptions for immediate image generation [3][29]. - It offers various input methods, including text prompts, voice input in both Chinese and English, and the ability to upload reference images for enhanced image generation [18][19]. - Users can optimize generated images by adjusting parameters such as reference image strength and can also use a feature to automatically enhance composition and depth [27][35]. Group 2: Technical Highlights - Hunyuan Image 2.0 features a significantly larger model size, increasing parameters by an order of magnitude compared to its predecessor, Hunyuan DiT, which enhances performance [37]. - The model incorporates a high-compression image codec developed by Tencent, which reduces encoding sequence length and speeds up image generation while maintaining quality [38]. - It utilizes a multimodal large language model (MLLM) as a text encoder, improving semantic understanding and matching capabilities compared to traditional models [39][40]. - The model has undergone reinforcement learning training to enhance the realism of generated images, aligning them more closely with real-world requirements [41]. - Tencent has developed a self-research adversarial distillation scheme that allows for high-quality image generation with fewer steps [42]. Group 3: Future Developments - Tencent's team has indicated that more details will be shared in upcoming technical reports, including information about a native multimodal image generation model [43][45]. - The new model is expected to excel in multi-round image generation and real-time interactive experiences [46].
文生图开源模型黑马,来自合肥
AI研究所· 2025-05-09 17:44
Core Viewpoint - The article highlights the emergence of HiDream.ai, a Chinese company that has developed a competitive image generation model, HiDream-I1, which rivals OpenAI's GPT-4o in performance and capabilities, marking a significant advancement in the AI field [1][3][6]. Group 1: HiDream-I1 Model Performance - HiDream-I1 achieved a score of 1123 on the ArtificialAnalysis platform, ranking second globally, just behind GPT-4o with a score of 1139, indicating a mere 0.8% performance gap [3][6]. - In various benchmark tests, HiDream-I1 outperformed other models such as MidjourneyV6 and DALL-E3, showcasing its superior capabilities in complex prompt understanding and image quality [6]. - HiDream-I1 is the only open-source image generation model that allows commercial use, which has attracted significant attention from developers and companies globally [6][10]. Group 2: Team Background and Business Strategy - HiDream.ai was founded in March 2023, with a team primarily composed of members from the University of Science and Technology of China, led by founder Mei Tao, who has a strong background in AI research [8][9]. - The company is exploring sustainable business models while focusing on user pain points to develop optimized products and services [10]. - HiDream.ai has already implemented its technology in various applications, including a strategic partnership with Cambrian for cloud acceleration and a collaboration with China Mobile for AI video products [11]. Group 3: Local Ecosystem and Industry Growth - HiDream.ai's success is closely tied to the supportive ecosystem in Hefei, which integrates resources from universities, government, and enterprises, fostering rapid AI industry growth [14][15]. - Hefei aims to achieve an AI industry scale exceeding 200 billion yuan by 2025, with significant investments in computing power and infrastructure [16][21]. - The city has established itself as a national AI industrial base, with over 2,200 companies and a revenue exceeding 200 billion yuan in 2023, showcasing its competitive edge in the AI sector [16][21].
文生图进入R1时刻:港中文MMLab发布T2I-R1
机器之心· 2025-05-09 02:47
Core Viewpoint - The article discusses the development of T2I-R1, a novel text-to-image generation model that utilizes a dual-level Chain of Thought (CoT) reasoning framework combined with reinforcement learning to enhance image generation quality and alignment with human expectations [1][3][11]. Group 1: Methodology - T2I-R1 employs two distinct levels of CoT reasoning: Semantic-CoT and Token-CoT. Semantic-CoT focuses on the global structure of the image, while Token-CoT deals with the detailed generation of image tokens [6][7]. - The model integrates Semantic-CoT to plan and reason about the image before generation, optimizing the alignment between prompts and generated images [7][8]. - Token-CoT generates image tokens sequentially, ensuring visual coherence and detail in the generated images [7][8]. Group 2: Model Enhancement - T2I-R1 enhances a unified language and vision model (ULM) by incorporating both Semantic-CoT and Token-CoT into a single framework for text-to-image generation [9][11]. - The model uses reinforcement learning to jointly optimize the two levels of CoT, allowing for multiple sets of Semantic-CoT and Token-CoT to be generated for a single image prompt [11][12]. Group 3: Experimental Results - The T2I-R1 model demonstrates improved robustness and alignment with human expectations when generating images based on prompts, particularly in unusual scenarios [13]. - Quantitative results indicate that T2I-R1 outperforms baseline models by 13% and 19% on the T2I-CompBench and WISE benchmarks, respectively, and surpasses previous state-of-the-art models [16].
AI生成字体设计我有点玩明白了,用这套Prompt提效50%。
数字生命卡兹克· 2025-04-13 17:16
阿真摸索出来的非常酷的用即梦3.0生成文字的用法~转载给大家。 嗨大家好!周一上班愉快! 每天脑子里都有很多想法转瞬即逝,不赶紧记录下来就会懒到不想再实践,于是就应该好好记录下来! 今天也是一个很不错的干货, 这组提示词的作用是,你只需要输入你的文字内容,就可以得到还不 错的文字设计的视觉效果。 为了它的效果测试和呈现我几乎掏空了我的即梦AI,测试非常多组合和风格后确信效果确实是还不错 的。 今天简短一点,欢迎大家轻松收看图片,然后查收提示词模板进行尝试! 先放一些看起来还不错的图文效果: "艺术家看到的比你多在哪"/"WHERE DO ARTISTS SEE BEYOND YOU",抽象概念书艺融合留白解构 风格,文字边界轻微溶解如意识边缘,漂浮排布构成意识碎片之感,背景为空灵灰白与虚实交织纹理, 如精神空间裂隙,字体采用半透明层叠毛笔线条,笔触轻盈而残缺,形成超现实视觉留白,气质抽离冷 静,带哲思与冥想氛围,思维跃迁感强烈,极简哲性构图,艺术意识流杰作 "电竞少年"/"E-SPORTS YOUTH",电竞动力融合动感秀逸与科幻光切风格,字体结构尖锐俐落,线条 如电流般延伸,高亮描边与速度动效结合,背景为深 ...
高速事故发酵,雷军首次回应;OpenAI估值3000亿美元,孙正义投的;金价连续新高,老铺黄金收入和利润也是丨百亿美元公司动向
晚点LatePost· 2025-04-01 15:36
雷军和小米汽车回应小米 SU7 高速交通事故。 4 月 1 日,小米公司发言人微博表示,2025 年 3 月 29 日 22 时 44 分,一辆小米 SU7 标准版在德上 高速公路池祁段行驶过程中遭遇严重交通事故,并造成 3 人死亡。据初步了解,事故发生前车辆处 于 NOA 智能辅助驾驶状态,以 116km/h 时速持续行驶。 据小米汽车公告,事发路段因施工修缮,用路障封闭自车道、改道至逆向车道。车辆检测出障碍物 后发出提醒并开始减速。约 1 秒后,驾驶员接管车辆进入人驾状态,NOA 功能退出。驾驶员持续 减速并操控车辆转向,随后车辆与隔离带水泥桩发生碰撞,碰撞前系统最后可以确认的时速约为 97km/h。 4 月 1 日晚间,小米汽车发布公告,称基于目前已知情况,仅能确定事故车起火并非自燃,推测系 猛烈撞击隔离带水泥桩后,整车系统严重受损导致,并表示由于尚未接触到事故车辆,暂时无法进 一步分析起火原因,以及事故时车门是否可以打开。雷军也首次公开回应此事,称 "代表小米承 诺:无论发生什么,小米都不会回避,我们将持续配合警方调查,跟进事情处理的进展,并尽最大 努力回应家属和社会关心的问题。" OpenAI 向免 ...
OpenAI复制吉卜力,大模型正在吞噬一切产品?
创业邦· 2025-03-28 10:32
Core Viewpoint - OpenAI's release of the GPT-4o model significantly enhances text-to-image generation capabilities, surpassing competitors in various aspects, including detail accuracy and user control [4][7][10]. Group 1: Product Features and Innovations - The GPT-4o model allows paid users to generate and modify images directly within ChatGPT, eliminating the need for separate models like DALL-E [4]. - The model's ability to generate images with high fidelity and detail consistency is a notable improvement over previous models, which often struggled with text clarity and image realism [7][10]. - GPT-4o introduces a more intuitive user experience, allowing users to provide simple conversational prompts rather than complex, precise instructions [10][20]. Group 2: Technical Advancements - The underlying technology of GPT-4o is based on a full-modal approach, enabling it to generate various data types, including text, images, audio, and video [13][14]. - The model employs an autoregressive method for image generation, contrasting with the diffusion model used by many competitors, which enhances the sequential creation of images [13][14]. - OpenAI has significantly improved the text-image alignment capability, allowing for more accurate interpretations of user prompts compared to traditional models [14][16]. Group 3: Market Impact and Competitive Landscape - The advancements in GPT-4o threaten existing startups in the text-to-image generation space, as the model's capabilities can render many previously developed tools obsolete [10][21]. - The rise of "Vibe Coding" reflects a shift in programming and creative processes, where users can generate code or images with minimal input, relying on the model's advanced capabilities [19][20]. - The competition in the AI space may increasingly favor larger companies with the resources to develop and train large models, potentially sidelining smaller startups that focus on niche optimizations [22][23].
OpenAI 复制吉卜力,大模型正在吞噬一切产品?
晚点LatePost· 2025-03-27 14:45
题图由 GPT-4o 生成,提示词是"请你根据下面这句话生成一个吉卜力风格的图像:周围有一圈人,看着一个机 器吐出图像"。 文 丨 贺乾明 编辑 丨 黄俊杰 新产品发布两天后,在 OpenAI 创始人山姆·阿尔特曼(Sam Altman)的推文下,有人祝贺他十年努力终 于带来了 AGI——社交网络上全是吉卜力图像 "All Ghibli Images"。 3 月 26 日,OpenAI 更新 GPT-4o 文生图功能。付费用户可以在 ChatGPT 直接调用 4o 生成、修改图 片,不再需要使用 OpenAI 的文生图模型 DALL-E。仅仅一天时间,近年影响较大的照片和 meme 图都 被 4o 重做了一遍,最流行的就是宫崎骏的画风。 左右滑动查看 人人都用生成吉卜力画风不仅仅因为宫崎骏对世界的卓绝贡献,也因为 OpenAI 的引导——阿尔特曼在 GPT-4o 新功能发布的直播里选择生成吉卜力风格的三人自拍照。但其实 GPT-4o 生成其他风格效果通常 也不错。 文生图已经不新鲜,此前也有文生图产品能实现风格化效果。比如 Midjourney 年付费用户可以改照片风 格,Stable Diffusion 也 ...