上下文光学压缩 - filings, earnings calls, financial reports, news

视觉标记

视觉标记

重磅，DeepSeek再开源：视觉即压缩，100个token干翻7000个

3 6 Ke· 2025-10-21 01:35

Core Insights - DeepSeek-OCR model explores the boundaries of visual-text compression, demonstrating the ability to decode over ten times the amount of text information from a limited number of visual tokens, thus addressing long-context issues in large language models (LLMs) [1][16]. Group 1: Model Performance - In the OmniDocBench benchmark, DeepSeek-OCR surpasses GOT-OCR2.0 by using only 100 visual tokens compared to GOT-OCR2.0's 256 tokens per page [2][44]. - The model shows practical value in OCR tasks, achieving a compression ratio of 10.5× to 19.7× while maintaining high decoding accuracy, with precision rates around 97% within a 10× compression ratio [37][41]. Group 2: Technical Architecture - DeepSeek-OCR employs an end-to-end visual language model (VLM) architecture consisting of an encoder (DeepEncoder) and a decoder (DeepSeek-3B-MoE) [21][34]. - The encoder, DeepEncoder, utilizes a novel architecture with approximately 380 million parameters, combining SAM-base and CLIP-large for feature extraction and tokenization [23][24]. Group 3: Compression Capabilities - The model can achieve a compression ratio of 7 to 20 times in different historical contexts, providing a feasible direction for addressing long-context issues in LLMs [16]. - DeepSeek-OCR can generate training data for LLMs/VLMs at a rate of 33 million pages daily using 20 computing nodes, each equipped with 8 A100-40G GPUs [39]. Group 4: Multilingual and Application Scope - DeepSeek-OCR is capable of processing nearly 100 languages, enhancing its applicability in global contexts [43]. - The model can also interpret charts, chemical equations, simple geometric shapes, and natural images, showcasing its versatility in various document types [43][44].

DeepSeek新模型被硅谷夸疯了！用二维视觉压缩一维文字，单GPU能跑，“谷歌核心机密被开源”

Hua Er Jie Jian Wen· 2025-10-21 00:27

Core Insights - DeepSeek has released an open-source model named DeepSeek-OCR, which is gaining significant attention in Silicon Valley for its innovative approach to processing long texts using visual compression techniques [1][4][21] - The model is designed to tackle the computational challenges associated with large models handling lengthy text, achieving high accuracy rates even with reduced token usage [1][4][5] Model Performance - DeepSeek-OCR operates with a model size of 3 billion parameters and demonstrates a remarkable ability to decode text with high accuracy, achieving 97% accuracy with a compression ratio of less than 10 times and maintaining 60% accuracy even at a 20 times compression ratio [1][4][5] - The model has been benchmarked against existing models, showing superior performance with significantly fewer visual tokens, such as using only 100 visual tokens to outperform models that require 256 tokens [7][8] Data Generation Efficiency - The model can generate over 200,000 pages of high-quality training data daily using a single A100-40G GPU, showcasing its efficiency in data generation [2][4] Innovative Approach - DeepSeek introduces a concept called "Contextual Optical Compression," which compresses textual information into visual formats, allowing the model to interpret content through images rather than text [4][10] - The architecture includes two main components: the DeepEncoder for converting images into compressed visual tokens and the DeepSeek3B-MoE-A570M for reconstructing text from these tokens [10][11] Flexibility and Adaptability - The DeepEncoder is designed to handle various input resolutions and token counts, allowing it to adapt to different compression needs and application scenarios [11][12] - The model supports complex image analyses, including financial reports and scientific diagrams, enhancing its applicability across diverse fields [12][14] Future Implications - The research suggests that this unified approach to visual and textual processing could be a step towards achieving Artificial General Intelligence (AGI) [4][21] - The team behind DeepSeek-OCR is exploring the potential of simulating human memory mechanisms through optical compression, which could lead to more efficient handling of long-term contexts in AI [20][21]

Seek .(US:SKLTY)

视觉 - 文本压缩范式

视觉 - 文本压缩范式

刚刚，DeepSeek重要突破，大模型上下文紧箍咒打破

3 6 Ke· 2025-10-20 23:22

Core Insights - DeepSeek has introduced a novel technology path in the competition of large language models by open-sourcing the DeepSeek-OCR model, which proposes the concept of "Contextual Optical Compression" for efficient information compression through text-to-image conversion [1][8]. Group 1: Model Performance and Capabilities - The feasibility of DeepSeek-OCR has been validated, achieving a decoding accuracy of 97% at a 10x compression ratio, indicating near-lossless compression, while maintaining approximately 60% accuracy at a 20x compression ratio [3][21]. - DeepSeek-OCR can express similar textual content using fewer tokens by converting text tokens into visual tokens, providing a new approach to address the high computational costs associated with processing long texts in large language models [6][11]. - In practical applications, DeepSeek-OCR surpassed GOT-OCR 2.0 using only 100 visual tokens and outperformed MinerU 2.0 with less than 800 visual tokens, demonstrating its efficiency [6][23]. Group 2: Technical Architecture - The architecture of DeepSeek-OCR consists of two main components: DeepEncoder, a visual encoder designed for high compression and high-resolution document processing, and DeepSeek3B-MoE, a lightweight mixture of experts language decoder [12][18]. - DeepEncoder employs a dual-structure design combining local and global attention to achieve high-fidelity visual understanding, significantly reducing the number of vision tokens generated from document images [14][18]. Group 3: Data and Training - DeepSeek-OCR's training process is relatively straightforward, involving independent training of DeepEncoder and the complete DeepSeek-OCR model, utilizing a large dataset for effective learning [20][21]. - The model has been trained on a diverse dataset that includes OCR 1.0 and OCR 2.0 data, general visual data, and pure text data, ensuring robust performance across various document types [25][36]. Group 4: Application and Future Directions - DeepSeek-OCR demonstrates capabilities in deep parsing, allowing it to recognize and extract structured information from various document types, including financial reports and scientific literature [24][29]. - The research team plans to further explore the integration of digital and optical text pre-training methods and evaluate the performance of optical compression in real long-text environments, indicating a promising direction for future research [39].

Seek .(US:SKLTY)

大语言模型