图像分词器造反了！华为 Selftok：自回归内核完美统一扩散模型，触发像素自主推理

Core Viewpoint - The article discusses the breakthrough of Huawei's Pangu multimodal generation team in transforming visual data into discrete tokens, aiming to replicate the success of large language models (LLMs) in the visual domain through a novel approach called Selftok [1][5]. Group 1: Selftok Breakthrough - Selftok technology integrates autoregressive (AR) principles into visual tokenization, allowing pixel streams to be converted into discrete sequences that adhere to causal relationships [1][3]. - The initial paper on Selftok has been recognized as a Best Paper Candidate at CVPR 2025, highlighting its significance in the field [3]. Group 2: Industry Consensus and Challenges - The current consensus in the industry is that LLMs face a language data bottleneck, while non-language data like images and videos hold significant development potential [5]. - A unified multimodal architecture is seen as key to unlocking stronger emergent capabilities in AI, with the main challenge being the conversion of continuous visual signals into discrete tokens [5]. Group 3: Advantages of Discrete Visual Tokens - The article argues for the abandonment of spatial priors in favor of discrete visual tokens, which can maintain high accuracy while avoiding the pitfalls of continuous representations [6]. - Continuous representations are criticized for their poor prediction stability, increased complexity in reinforcement learning, and limited decoupling capabilities [6]. Group 4: Selftok Architecture - Selftok's architecture consists of an encoder, quantizer, and decoder, utilizing a dual-stream structure to enhance computational efficiency and maintain reconstruction quality [18][20]. - The quantizer employs a unique mechanism to address traditional training imbalances, achieving a unified approach to diffusion processes and autoregressive modeling [20]. Group 5: Training and Optimization - The pre-training phase of Selftok involves aligning multimodal data inputs to transition from LLM to visual-language model (VLM) [24]. - The model is optimized using reinforcement learning (RL) algorithms, with two types of reward functions designed to evaluate the generated images' attributes and spatial relationships [25][27]. Group 6: Experimental Results - Selftok has achieved state-of-the-art (SoTA) results in various reconstruction metrics on ImageNet, demonstrating superior detail preservation compared to other tokenizers [28]. - In benchmark evaluations, Selftok-Zero significantly outperformed models like GPT-4o, achieving a score of 92 on the GenEval benchmark [29][30].