Workflow
Visual Tokenizer
icon
Search documents
压缩之外,Visual Tokenizer 也要理解世界?
机器之心· 2025-12-28 01:30
机器之心PRO · 会员通讯 Week 52 --- 本周为您解读 ② 个 值得细品的 AI & Robotics 业内要事 --- 1. 压缩之外,Visual Tokenizer 也要理解世界? Visual Tokenizer 下一步进化的关键在于理解世界?相比 2D 网格序列,采用 1D 序列形式的 tokenizer 更适合大规模训练?也许目前的离散 tokenizer 可能只是阶段性的过渡性方案?生成采 样阶段的分布偏移,导致了「重建强、生成弱」的普遍现象?离散 tokenizer 如何在不牺牲压缩效率的前提下超过连续潜在空间的重建质量? ... 2. Demis Hassabis 深度访谈:为什么做 AGI 需要回到「AlphaGo 模式」? 什么是「锯齿状智能」?为什么 AGI 需要回到「AlphaGo」模式?SIMA 与 Genie 如何协同「好奇心」创造无限训练资源?如何通过「物理基准测试」消除模拟世界的幻觉?「根节点问题」如 何产生连锁反应?AGI 会如何推动经济重构?... 本期完整版通讯含 2 项专题解读 + 24 项本周 AI & Robotics 赛道要事速递,其中技术方面 9 ...
NeurIPS 2025|VFMTok: Visual Foundation Models驱动的Tokenizer时代来临
机器之心· 2025-10-28 09:37
Core Insights - The article discusses the potential of using frozen Visual Foundation Models (VFMs) as effective visual tokenizers for autoregressive image generation, highlighting their ability to enhance image reconstruction and generation tasks [3][11][31]. Group 1: Traditional Visual Tokenizers - Traditional visual tokenizers like VQGAN require training from scratch, leading to a potential space that lacks high-level semantic information and has high redundancy [4][7]. - The organization of the latent space in traditional models is chaotic, resulting in longer training times and the need for additional techniques like Classifier-Free Guidance (CFG) for high-fidelity image generation [7][12]. Group 2: Visual Foundation Models (VFMs) - Pre-trained VFMs such as CLIP, DINOv2, and SigLIP2 excel in extracting rich semantic and generalizable visual features, primarily used for image content understanding tasks [4][11]. - The hypothesis proposed by the research team is that the latent features from these VFMs can also be utilized for image reconstruction and generation tasks [4][10]. Group 3: VFMTok Architecture - VFMTok utilizes frozen VFMs to construct high-quality visual tokenizers, employing multi-level feature extraction to capture both low-level details and high-level semantics [14][17]. - The architecture includes a region-adaptive quantization mechanism that improves token efficiency by focusing on consistent patterns within the image [18][19]. Group 4: Experimental Findings - VFMTok demonstrates superior performance in image reconstruction and autoregressive generation compared to traditional tokenizers, achieving better reconstruction quality with fewer tokens (256) [23][28]. - The convergence speed of autoregressive models during training is significantly improved with VFMTok, outperforming classic models like VQGAN [24][26]. Group 5: CFG-Free Performance - VFMTok shows consistent performance with or without CFG, indicating strong semantic consistency in its latent space, which allows for high-fidelity class-to-image generation without additional guidance [33]. - The reduction in token count leads to approximately four times faster inference speed during the generation process [33]. Group 6: Future Outlook - The findings suggest that leveraging the prior knowledge from VFMs is crucial for constructing high-quality latent spaces and developing the next generation of tokenizers [32]. - The potential for a unified tokenizer that is semantically rich and efficient across various generative models is highlighted as a future research direction [32].