跨模态模型

Search documents
Being-VL的视觉BPE路线:把「看」和「说」真正统一起来
具身智能之心· 2025-10-11 00:02
Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations to text space, leading to potential hallucinations when details are queried without strong language dependence [1][5] - A new method called Being-VL is proposed, which focuses on visual BPE (Byte Pair Encoding) to improve the alignment and modeling of visual and textual data [1][2] Group 1: Being-VL Methodology - Being-VL consists of three main steps: quantizing images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [2][5] - The Priority-Guided Encoding approach is introduced, which combines frequency and spatial consistency to create a more semantically and structurally meaningful visual token set [7][8] Group 2: Training Strategy - The training process is divided into three stages: initial alignment of visual token embeddings, selective fine-tuning of the LLM, and full fine-tuning on complex reasoning and instruction data [9][15] - A curriculum learning strategy is employed to gradually transition from basic tasks to more complex ones, enhancing the model's ability to understand cross-modal interactions [9][12] Group 3: Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE leads to improved reliability in detail-sensitive tasks and reduces hallucinations compared to traditional methods [12][16] - The introduction of visual BPE significantly enhances the model's performance and robustness, demonstrating that the semantic integration of stable visual patterns into tokens allows for better reasoning [12][19] Group 4: Tokenization and Efficiency - The study highlights the impact of BPE token size on training efficiency, suggesting that a balanced token size can optimize both expressiveness and training efficiency [19][20] - Larger token sizes may lead to sparse distributions and decreased returns on computational resources, indicating a need for careful scaling in future applications [19][20]