自回归图像生成 - filings, earnings calls, financial reports, news

自回归图像生成

Search documents

联合理解生成的关键拼图？腾讯发布X-Omini：强化学习让离散自回归生成方法重焕生机，轻松渲染长文本图像

机器之心· 2025-08-10 04:31

Core Insights - The article discusses the advancements in image generation technology, particularly focusing on the X-Omni model developed by Tencent's team, which significantly enhances the quality of autoregressive image generation through reinforcement learning [2][4][5]. Group 1: Model Development - The X-Omni model utilizes reinforcement learning to improve the aesthetic quality of generated images and its ability to follow complex instructions, showcasing superior performance in rendering long texts [5][6]. - The model architecture is based on discrete tokens and employs a diffusion decoder to generate images, allowing for a unified approach to visual understanding and generation [6][11]. Group 2: Reinforcement Learning Approach - The reinforcement learning process incorporates a comprehensive reward model that evaluates image generation quality from multiple dimensions, including human aesthetic preferences and text-image semantic alignment [9][12]. - The introduction of the GRPO reinforcement learning method enhances the model's image generation capabilities, demonstrating that RL optimization surpasses traditional supervised fine-tuning methods [8][19]. Group 3: Performance Evaluation - The X-Omni model outperforms existing models in various benchmarks, achieving high scores in both text rendering and instruction-following capabilities, with scores of 0.901 in English and 0.895 in Chinese for text rendering [13][14]. - In instruction-following assessments, X-Omni achieved an overall score of 87.65, indicating its effectiveness in understanding and executing complex prompts [14]. Group 4: Unique Findings - Unlike traditional autoregressive models that rely heavily on classifier-free guidance (CFG) to enhance generation quality, X-Omni can produce high-quality images without CFG, demonstrating a high degree of integration between visual and language generation mechanisms [17]. - The research highlights the unique advantages of reinforcement learning in image generation, providing more comprehensive and efficient optimization signals compared to conventional methods [19].

视觉Token无缝对齐LLMs词表！V²Flow：基于LLMs实现高保真自回归图像生成

量子位· 2025-04-03 02:12

1、传统视觉tokenizer生成的离散表征与LLM词表存在显著的分布偏差。 V²Flow团队发自凹非寺量子位 | 公众号 QbitAI 视觉T oken可以与LLMs 词表无缝对齐了！ V²Flow，基于LLMs可以实现高保真自回归图像生成。实现自回归图像生成的关键是设计向量化(Vector-Quantization)的视觉Tokenizer，将视觉内容离散化成类似于大语言模型词表的离散Token。现有方法虽取得进展，却始终面临两大桎梏： 2、维度诅咒：图像的二维结构迫使大语言模型以逐行方式预测视觉token，与一维文本的连贯语义预测存在本质冲突。结构性与特征分布性的双重割裂，暴露了当前自回归视觉生成的重大缺陷：缺乏能够既保证高保真图像重建，又能与预训练LLMs词汇表在结构上和特征分布上统一的视觉tokenizer。解决这一问题对于实现有效的多模态自回归建模和增强的指令遵循能力至关重要。因此，一个核心问题是：能否设计一种视觉tokenizer，使生成的离散视觉token在保证高质量视觉重建的同时，与预训练LLMs 词汇表实现无缝融合？统一视觉Token与大语言模型词表最新开源 ...