Being-VL的视觉BPE路线：把「看」和「说」真正统一起来

Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations with text space, leading to potential hallucinations when detailed, non-language-dependent queries are made [2][6] - A new method called Being-VL is proposed, which emphasizes a post-alignment approach, allowing for the discrete representation of images before aligning them with text, thereby preserving visual structure and reducing the risk of information loss [2][3] Being-VL Implementation - Being-VL consists of three main steps: quantifying images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [3][10] - The visual BPE tokenizer prioritizes both frequency and spatial consistency to create a more semantically and structurally meaningful token set, which is independent of text [8][9] Training Strategy - The training process is divided into three stages: 1. Embedding Alignment: Only the new visual token embeddings are trained while freezing other parameters to maintain existing language capabilities [12] 2. Selective Fine-tuning: A portion of the LLM layers is unfrozen to facilitate cross-modal interaction at lower representation levels [12] 3. Full Fine-tuning: All layers are unfrozen for comprehensive training on complex reasoning and instruction data [12][10] Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE and unified modeling with text leads to improved reliability in detail-sensitive queries and reduces hallucinations compared to traditional methods [14][16] - The study highlights the importance of a gradual training approach, showing that a combination of progressive unfreezing and curriculum learning significantly outperforms single-stage training methods [14][10] Visual BPE Token Activation - Visualization of embedding weights shows that using visual BPE leads to a more balanced distribution of weights between text and visual tokens, indicating reduced modality gaps and improved cross-modal attention [16][19] Token Size and Training Efficiency - The research explores the impact of BPE token size on training efficiency, finding an optimal balance in resource-limited scenarios, while larger token sizes may lead to diminishing returns due to sparsity [19][20] Development and Summary - The evolution from Being-VL-0 to Being-VL-0.5 reflects enhancements in the unified modeling framework, incorporating priority-guided encoding and a structured training approach [20][24]