Seek .-DeepSeek又发新模型，小而美玩出新高度

Core Insights - The article discusses the challenges faced by current LLMs in processing long texts due to quadratic growth in computational complexity, which increases with longer sequences [1] - DeepSeek-OCR presents a novel approach to address this issue by utilizing "optical compression," converting text into images to reduce the number of tokens required for processing [5][34] - The model demonstrates a token compression capability of 7 to 20 times while maintaining high accuracy, showcasing its potential for efficient long-context processing [34][36] Technical Overview - DeepSeek-OCR achieves a compression rate of up to 10 times with an accuracy of over 97% [4] - The model uses a two-component architecture: DeepEncoder for image feature extraction and compression, and DeepSeek3B-MoE for reconstructing text from compressed visual tokens [16][18] - The DeepEncoder employs a clever architecture combining SAM-base and CLIP-large models, along with a convolutional compressor to significantly reduce token numbers before entering the global attention layer [10][11] Performance Metrics - OmniDocBench benchmark results indicate that a single A100-40G GPU can generate over 200,000 pages of LLM/VLM training data daily, while 20 nodes (160 A100 GPUs) can produce up to 33 million pages [7] - DeepSeek-OCR outperforms existing models, requiring only 100 visual tokens to exceed the performance of GOT-OCR2.0, which uses 256 tokens per page [15] Data Utilization - The DeepSeek team collected 30 million pages of multilingual PDF data, covering around 100 languages, with a focus on Chinese and English [21] - The data is categorized into coarse and fine annotations, with high-quality data generated through various models to enhance recognition capabilities [22] Application Potential - DeepSeek-OCR not only recognizes text but also possesses deep parsing capabilities, making it suitable for STEM applications that require structured extraction from complex images [27] - The model can extract structured data from financial reports, chemical structures, geometric figures, and generate dense captions for natural images [28] Future Directions - The team proposes exploring the concept of "optical compression" to simulate human memory decay, allowing for efficient processing of long contexts by reducing the fidelity of older information [30][31] - Future plans include conducting systematic evaluations and pre-training methods to further validate the effectiveness of this approach [35]