Workflow
长上下文
icon
Search documents
智谱运气是差一点点,视觉Token研究又和DeepSeek撞车了
量子位· 2025-10-22 15:27
Core Viewpoint - The article discusses the competition between Zhipu and DeepSeek in the AI field, particularly focusing on the release of Zhipu's visual token solution, Glyph, which aims to address the challenges of long context in large language models (LLMs) [1][2][6]. Group 1: Context Expansion Challenges - The demand for long context in LLMs is increasing due to various applications such as document analysis and multi-turn dialogues [8]. - Expanding context length significantly increases computational costs; for instance, increasing context from 50K to 100K tokens can quadruple the computational consumption [9][10]. - Merely adding more tokens does not guarantee improved model performance, as excessive input can lead to noise interference and information overload [12][14]. Group 2: Existing Solutions - Three mainstream solutions to the long context problem are identified: 1. **Extended Position Encoding**: This method extends the existing position encoding range to accommodate longer inputs without retraining the model [15][16]. 2. **Attention Mechanism Modification**: Techniques like sparse and linear attention aim to improve token processing efficiency, but do not reduce the total token count [20][21]. 3. **Retrieval-Augmented Generation (RAG)**: This approach uses external retrieval to shorten inputs, but may slow down overall response time [22][23]. Group 3: Glyph Framework - Glyph proposes a new paradigm by converting long texts into images, allowing for higher information density and efficient processing by visual language models (VLMs) [25][26]. - By using visual tokens, Glyph can significantly reduce the number of tokens needed; for example, it can represent the entire text of "Jane Eyre" using only 80K visual tokens compared to 240K text tokens [32][36]. - The training process for Glyph involves three stages: continual pre-training, LLM-driven rendering search, and post-training, which collectively enhance the model's ability to interpret visual information [37][44]. Group 4: Performance and Results - Glyph achieves a token compression rate of 3-4 times while maintaining accuracy comparable to mainstream models [49]. - The implementation of Glyph results in approximately four times faster prefill and decoding speeds, as well as two times faster supervised fine-tuning (SFT) training [51]. - Glyph demonstrates strong performance in multimodal tasks, indicating its robust generalization capabilities [53]. Group 5: Contributors and Future Implications - The primary author of the paper is Jiale Cheng, a PhD student at Tsinghua University, with contributions from Yusen Liu, Xinyu Zhang, and Yulin Fei [57][62]. - The article suggests that visual tokens may redefine the information processing methods of LLMs, potentially leading to pixels replacing text as the fundamental unit of AI input [76][78].
DeepSeek-V3.1版本更新,双模式开放体验
Feng Huang Wang· 2025-09-23 07:29
Core Insights - The new version DeepSeek-V3.1-Terminus has been launched, featuring both "Thinking Mode" and "Non-Thinking Mode" with support for 128K long context [1] Group 1: Model Upgrades - The deepseek-chat and deepseek-reasoner models have been unified and upgraded to DeepSeek-V3.1-Terminus, with deepseek-chat corresponding to Non-Thinking Mode and deepseek-reasoner to Thinking Mode [1] - Key optimizations include improved language consistency, significantly alleviating issues with mixed Chinese and English as well as abnormal characters, resulting in more standardized outputs [1] - The Agent capabilities have been further enhanced, particularly the execution performance of Code Agent and Search Agent [1] Group 2: Output Length and Pricing - In terms of output length, Non-Thinking Mode supports a default of 4K, with a maximum of 8K, while Thinking Mode has a default of 32K and can be expanded up to 64K, catering to different generation length requirements [1] - Pricing for the new model is set at 0.5 yuan for cache hits and 4 yuan for cache misses per million tokens input, with an output pricing of 12 yuan per million tokens, providing developers with a cost-effective AI large model service [1]
MiniMax重磅开源M1模型:百万上下文超DeepSeek R1,实现性能与效率双杀
AI科技大本营· 2025-06-17 02:32
Core Insights - MiniMax has officially open-sourced its latest large language model, MiniMax-M1, marking a significant development in the AI landscape [2][4] - MiniMax-M1 is recognized as the world's first open-weight large-scale hybrid attention inference model, showcasing substantial breakthroughs in performance and inference efficiency [4][6] Model Specifications - MiniMax-M1 features a parameter scale of 456 billion, with each token activating approximately 45.9 billion parameters, and supports a maximum context length of 1 million tokens, which is 8 times longer than that of DeepSeek R1 [7][12] - The model's computational load (FLOPs) for generating 100,000 tokens is only 25% of that required by DeepSeek R1, indicating a significant advantage in long text processing tasks [7][12] Training and Efficiency - The training of MiniMax-M1 utilized a large-scale reinforcement learning (RL) strategy, optimizing performance across various tasks, including mathematical reasoning and software engineering [9][11] - The complete RL training of MiniMax-M1 was accomplished in three weeks using 512 H800 GPUs, with a cost of approximately $534,700, demonstrating high efficiency and cost-effectiveness [11] Performance Comparison - MiniMax-M1 is available in two versions, with maximum generation lengths of 40K and 80K tokens, and has shown superior performance in complex software engineering, tool usage, and long-context tasks compared to leading open-weight models like DeepSeek-R1 and Qwen3-235B [12][19] - In benchmark tests, MiniMax-M1 outperformed other models in various categories, including long-context understanding and tool usage, establishing itself as a strong contender in the AI model landscape [19]