自回归图像生成

Search documents
联合理解生成的关键拼图?腾讯发布X-Omini:强化学习让离散自回归生成方法重焕生机,轻松渲染长文本图像
机器之心· 2025-08-10 04:31
Core Insights - The article discusses the advancements in image generation technology, particularly focusing on the X-Omni model developed by Tencent's team, which significantly enhances the quality of autoregressive image generation through reinforcement learning [2][4][5]. Group 1: Model Development - The X-Omni model utilizes reinforcement learning to improve the aesthetic quality of generated images and its ability to follow complex instructions, showcasing superior performance in rendering long texts [5][6]. - The model architecture is based on discrete tokens and employs a diffusion decoder to generate images, allowing for a unified approach to visual understanding and generation [6][11]. Group 2: Reinforcement Learning Approach - The reinforcement learning process incorporates a comprehensive reward model that evaluates image generation quality from multiple dimensions, including human aesthetic preferences and text-image semantic alignment [9][12]. - The introduction of the GRPO reinforcement learning method enhances the model's image generation capabilities, demonstrating that RL optimization surpasses traditional supervised fine-tuning methods [8][19]. Group 3: Performance Evaluation - The X-Omni model outperforms existing models in various benchmarks, achieving high scores in both text rendering and instruction-following capabilities, with scores of 0.901 in English and 0.895 in Chinese for text rendering [13][14]. - In instruction-following assessments, X-Omni achieved an overall score of 87.65, indicating its effectiveness in understanding and executing complex prompts [14]. Group 4: Unique Findings - Unlike traditional autoregressive models that rely heavily on classifier-free guidance (CFG) to enhance generation quality, X-Omni can produce high-quality images without CFG, demonstrating a high degree of integration between visual and language generation mechanisms [17]. - The research highlights the unique advantages of reinforcement learning in image generation, providing more comprehensive and efficient optimization signals compared to conventional methods [19].
视觉Token无缝对齐LLMs词表!V²Flow:基于LLMs实现高保真自回归图像生成
量子位· 2025-04-03 02:12
1、 传统视觉tokenizer生成的离散表征与LLM词表存在显著的分布偏差。 V²Flow团队 发自 凹非寺 量子位 | 公众号 QbitAI 视觉T oken可以与LLMs 词表无缝对齐了! V²Flow,基于LLMs可以实现高保真自回归图像生成。 实现自回归图像生成的关键是设计向量化(Vector-Quantization)的视觉Tokenizer,将视觉内容离散化成类 似于大语言模型词表的离散Token。 现有方法虽取得进展,却始终面临两大桎梏: 2、 维度诅咒:图像的二维结构迫使大语言模型以逐行方式预测视觉token,与一维文本的连贯语义预测存 在本质冲突。 结构性与特征分布性的双重割裂,暴露了当前自回归视觉生成的重大缺陷:缺乏能够既保证高保真图像重 建,又能与预训练LLMs词汇表在结构上和特征分布上统一的视觉tokenizer。解决这一问题对于实现有效的 多模态自回归建模和增强的指令遵循能力至关重要。 因此,一个核心问题是: 能否设计一种视觉tokenizer,使生成的离散视觉token在保证高质量视觉重建的同时,与预训练LLMs 词汇表实现无缝融合? 统一视觉Token与大语言模型词表 最新开源 ...