视觉Token
Search documents
智谱运气是差一点点,视觉Token研究又和DeepSeek撞车了
量子位· 2025-10-22 15:27
Core Viewpoint - The article discusses the competition between Zhipu and DeepSeek in the AI field, particularly focusing on the release of Zhipu's visual token solution, Glyph, which aims to address the challenges of long context in large language models (LLMs) [1][2][6]. Group 1: Context Expansion Challenges - The demand for long context in LLMs is increasing due to various applications such as document analysis and multi-turn dialogues [8]. - Expanding context length significantly increases computational costs; for instance, increasing context from 50K to 100K tokens can quadruple the computational consumption [9][10]. - Merely adding more tokens does not guarantee improved model performance, as excessive input can lead to noise interference and information overload [12][14]. Group 2: Existing Solutions - Three mainstream solutions to the long context problem are identified: 1. **Extended Position Encoding**: This method extends the existing position encoding range to accommodate longer inputs without retraining the model [15][16]. 2. **Attention Mechanism Modification**: Techniques like sparse and linear attention aim to improve token processing efficiency, but do not reduce the total token count [20][21]. 3. **Retrieval-Augmented Generation (RAG)**: This approach uses external retrieval to shorten inputs, but may slow down overall response time [22][23]. Group 3: Glyph Framework - Glyph proposes a new paradigm by converting long texts into images, allowing for higher information density and efficient processing by visual language models (VLMs) [25][26]. - By using visual tokens, Glyph can significantly reduce the number of tokens needed; for example, it can represent the entire text of "Jane Eyre" using only 80K visual tokens compared to 240K text tokens [32][36]. - The training process for Glyph involves three stages: continual pre-training, LLM-driven rendering search, and post-training, which collectively enhance the model's ability to interpret visual information [37][44]. Group 4: Performance and Results - Glyph achieves a token compression rate of 3-4 times while maintaining accuracy comparable to mainstream models [49]. - The implementation of Glyph results in approximately four times faster prefill and decoding speeds, as well as two times faster supervised fine-tuning (SFT) training [51]. - Glyph demonstrates strong performance in multimodal tasks, indicating its robust generalization capabilities [53]. Group 5: Contributors and Future Implications - The primary author of the paper is Jiale Cheng, a PhD student at Tsinghua University, with contributions from Yusen Liu, Xinyu Zhang, and Yulin Fei [57][62]. - The article suggests that visual tokens may redefine the information processing methods of LLMs, potentially leading to pixels replacing text as the fundamental unit of AI input [76][78].