Workflow
视觉tokenizer
icon
Search documents
ETT:打破原生多模态学习视觉瓶颈,重塑视觉tokenizer优化范式
机器之心· 2025-05-27 06:38
Core Viewpoint - The article introduces ETT (End-to-End Vision Tokenizer Tuning), a novel method that optimizes visual tokenization and downstream tasks jointly, addressing the limitations of traditional visual tokenization methods [2][4]. Group 1: Limitations of Traditional Methods - Traditional visual tokenization methods suffer from a critical flaw where the optimization of visual tokenizers is decoupled from the training of downstream tasks, leading to suboptimal performance in tasks requiring rich semantic representation [1][5]. - Existing multimodal pre-training frameworks, such as Emu3, utilize frozen visual tokenizers, wasting their rich feature representation capabilities and hindering end-to-end training [6][10]. Group 2: ETT Innovations - ETT innovatively combines visual tokenization with target autoregressive tasks for joint optimization, allowing visual tokenizers to adapt based on feedback from downstream tasks [4][10]. - The architecture of ETT is based on an improved IBQ framework, with a codebook size of 131,072 and feature dimensions set to 256, enhancing the efficiency of visual tokenizers [10]. Group 3: Training Strategy - ETT employs a structured training strategy, starting with an alignment learning phase where only the visual projection layer is trained while keeping the large language model and visual tokenizer parameters frozen [11]. - In the semantic learning phase, all model weights are unfrozen for end-to-end training, allowing the visual tokenizer to enhance its perceptual capabilities while maintaining image reconstruction abilities [11]. Group 4: Performance Metrics - ETT demonstrates superior performance in multimodal understanding tasks, achieving competitive results on benchmarks like GQA and MMBench, even with fewer model parameters and training data compared to state-of-the-art visual language models [12][13]. - In multimodal generation tasks, ETT matches the performance of advanced diffusion and autoregressive models while being more efficient in terms of model parameters and training data [14][15]. Group 5: Qualitative Results - ETT generates diverse and detailed visual content that adheres closely to text prompts, showcasing its ability to produce high-quality images across various artistic styles and themes [16]. Group 6: Visual Reconstruction - ETT significantly enhances visual reconstruction tasks, preserving low-level details while improving high-level semantic representation, thus providing better visual representations for multimodal tasks [17]. Group 7: Future Directions - Future research will focus on expanding the data scale and model capacity for ETT, exploring end-to-end training of visual tokenizers from scratch, and extending ETT's methodology to other modalities like video and audio [19]. Group 8: Conclusion - ETT represents a breakthrough in native multimodal learning, offering a simple yet effective approach to optimize visual tokenizers, thereby enhancing the performance of multimodal models and paving the way for broader applications [25].