DeepSeek新模型被硅谷夸疯了！用二维视觉压缩一维文字，单GPU能跑，“谷歌核心机密被开源”

Core Insights - DeepSeek has released a groundbreaking open-source model named DeepSeek-OCR, which is gaining significant attention in Silicon Valley for its innovative approach to processing long texts with high efficiency [1][3][7]. Model Overview - The DeepSeek-OCR model addresses the computational challenges associated with large models handling long texts by utilizing a method that compresses textual information into visual tokens, thereby reducing the number of tokens needed for processing [5][12][13]. - The model achieves high accuracy rates, with a decoding accuracy of 97% when the compression ratio is less than 10 times and around 60% even at a 20 times compression ratio [6]. Performance Metrics - DeepSeek-OCR has demonstrated superior performance on the OmniDocBench benchmark, achieving state-of-the-art (SOTA) results with significantly fewer visual tokens compared to existing models [14][15]. - For instance, using only 100 visual tokens, DeepSeek-OCR outperforms the GOT-OCR2.0 model, which uses 256 tokens, and matches the performance of other models while using far fewer tokens [17]. Technical Components - The architecture of DeepSeek-OCR consists of two main components: the DeepEncoder, which converts high-resolution images into highly compressed visual tokens, and the DeepSeek3B-MoE-A570M decoder, which reconstructs text from these tokens [20][22]. - The model supports various input modes, allowing it to adapt its compression strength based on the specific task requirements [24]. Innovative Concepts - The research introduces the concept of "Contextual Optical Compression," which simulates human memory mechanisms by dynamically allocating computational resources based on the temporal context of the information being processed [36][38]. - This approach aims to enhance the model's ability to handle long conversations or documents, potentially leading to a more human-like memory structure in AI systems [39][41].