Workflow
Vector Quantization(向量量化)
icon
Search documents
Discrete Tokenization:多模态大模型的关键基石,首个系统化综述发布
机器之心· 2025-08-05 18:56
Core Insights - The article discusses the advancements in Discrete Tokenization for Multimodal Large Language Models (LLMs), emphasizing its role in transforming various modalities into discrete representations that LLMs can process effectively [2][39]. - A comprehensive survey has been released, detailing the technical landscape, challenges, and future research directions in the field of Discrete Tokenization for Multimodal LLMs [2][39]. Multimodal LLMs and Discrete Tokenization - Recent breakthroughs in Large Language Models (LLMs) have led to their application in various text tasks, prompting interest in extending their capabilities to non-text modalities such as images, audio, and video [2]. - Discrete Tokenization has emerged as a key solution, utilizing techniques like Vector Quantization (VQ) to compress high-dimensional continuous inputs into compact discrete tokens, enhancing cross-modal understanding and generation [2][39]. Systematic Review and Methodologies - The article presents the first systematic review of Discrete Tokenization for Multimodal LLMs, organizing content based on input data modalities and combinations, from early single-modal to multi-modal tokenization methods [2][39]. - Eight core categories of Vector Quantization methods are identified, including VQ, RVQ, PQ, AQ, FSQ, LFQ, BSQ, and Graph Anchor-Relation Tokenization, each with unique characteristics suitable for different modalities and tasks [8][9][14]. Challenges and Future Directions - Key challenges in Discrete Tokenization include codebook collapse, information loss during quantization, difficulties in gradient propagation, and issues with granularity and semantic alignment [12][36]. - Future research directions may focus on adaptive quantization, unified frameworks, biologically inspired codebooks, cross-modal generalization, and enhancing interpretability [37][36]. Applications in Single and Multimodal Tasks - Discrete Tokenization has been widely applied in single-modal tasks such as image retrieval, audio encoding, and video representation, allowing LLMs to process non-text modalities effectively [20][22]. - In multimodal tasks, it serves as a semantic bridge, enabling models to handle complex inputs across different modalities, facilitating tasks like cross-modal retrieval and generation [27][30].