Workflow
光学压缩
icon
Search documents
两周复刻DeepSeek-OCR!两人小团队还原低token高压缩核心,换完解码器更实用
量子位· 2025-11-07 05:32
Core Insights - The article discusses the development of DeepOCR, a replica of the previously acclaimed DeepSeek-OCR, achieved by a small team in just two weeks, maintaining the original's advantages of low token usage and high compression [1][5]. Group 1: Technology and Design - DeepSeek-OCR's design philosophy focuses on "visual compression," using a limited number of visual tokens to represent content that would typically require many text tokens, thus reducing computational costs associated with large models [4][6]. - The model achieves a compression ratio of 7-20 times, maintaining an accuracy of 97% even with a 10-fold compression [7]. - The architecture of DeepSeek-OCR includes a three-stage structure: local processing, compression, and global understanding, which helps manage memory usage effectively [10]. Group 2: Training and Performance - DeepOCR is designed to be low-computationally intensive, allowing it to be trained on just two H200 GPUs, making it accessible for small teams [21]. - The training process consists of two phases, with the first phase focusing on training a multi-modal projector while keeping the DeepEncoder frozen, significantly reducing memory requirements [20]. - In practical tests, DeepOCR uses approximately 250 visual tokens, which, while slightly less efficient than the original DeepSeek-OCR, is still significantly better than baseline models that require thousands of tokens for similar performance [22]. Group 3: Results and Future Plans - DeepOCR shows strong performance in basic tasks such as English text recognition and table parsing, with table parsing even outperforming the original model due to precise restoration of the original 2D spatial encoding [24]. - The team plans to enhance the model by incorporating additional data types, including formulas and multi-language support, and exploring advanced techniques to further improve performance [28]. - The article highlights the team's academic backgrounds, showcasing their expertise in multi-modal fields and previous experience in notable tech companies [29][31].
AI 又进化了,DeepSeek 再推 “ 王炸 ” 新功能
3 6 Ke· 2025-10-24 11:48
Core Insights - DeepSeek has introduced a new open-source model called DeepSeek-OCR, which utilizes a 30 billion parameter architecture to read text through images, effectively compressing text into visual tokens [1][2][19]. Group 1: Model Functionality - The model aims to replace traditional text tokens with visual tokens, achieving optical compression that allows for significant reductions in the amount of data processed [2][5]. - For instance, content that originally required 1000 tokens can now be represented with just 100 visual tokens, achieving a compression ratio of 10 times while maintaining 97% OCR accuracy [5][19]. - The model consists of two main components: DeepEncoder for image compression and DeepSeek3B-MoE for decoding the visual tokens back into text [11][12]. Group 2: Training and Data Utilization - DeepSeek trained the model on an extensive dataset of 30 million PDF documents across 100 languages, with a significant portion being in Chinese and English [12][14]. - The training also included 3 million Word documents for specialized tasks such as formula recognition and HTML table extraction, showcasing a comprehensive approach to data coverage [14][19]. Group 3: Performance and Efficiency - In tests, DeepSeek-OCR outperformed existing models like GOT-OCR2.0 and MinerU2.0, demonstrating superior performance with fewer visual tokens [16][19]. - The model's architecture allows it to operate efficiently, activating only a fraction of its parameters during processing, which enhances speed and reduces computational load [11][19]. Group 4: Philosophical Implications - The model introduces a concept of selective memory, simulating human-like forgetting by compressing older information over time, which could lead to more efficient long-term interactions [16][18]. - This approach challenges traditional notions of memory in AI, suggesting that effective information retention may not always require accumulation but rather a focus on relevance and clarity [18][22].
DeepSeek-OCR技术深度剖析:长文本处理的光学压缩路径与产业应用前瞻
Investment Rating - The report does not explicitly provide an investment rating for the industry or specific companies involved in the DeepSeek-OCR technology. Core Insights - DeepSeek-OCR technology offers a new approach to long-text processing by mapping text into high-resolution 2D images and compressing them into visual tokens, achieving approximately 97% decoding accuracy at a 10x compression ratio and maintaining about 60% accuracy at a 20x compression ratio [1][9] - The technology is particularly advantageous for processing structured information such as tables and charts, which can significantly reduce computational and memory resource consumption in long-document scenarios [1][9] - DeepSeek-OCR represents a shift from traditional long-text processing methods that focus on expanding context windows to a more efficient "compress-then-decompress" model, allowing for lower computational loads [2][10] Summary by Sections Technology Overview - DeepSeek-OCR utilizes a model with approximately 57 billion parameters to reconstruct text from compressed visual tokens, demonstrating high accuracy even under extreme compression conditions [1][9] - The technology aligns with the "pixel-unified input" paradigm, facilitating the processing of heterogeneous information types [1][9] Comparative Analysis - DeepSeek-OCR and other models like ChatGPT/Gemini represent different technical approaches: DeepSeek focuses on high-density storage through compression, while ChatGPT/Gemini expands context windows for immediate access [4][12] - The two approaches complement each other, with DeepSeek-OCR being more efficient for low-cost long-context memory storage, while large-window models are better suited for detailed reasoning tasks [4][12] Application Strategy - The report suggests using lower compression rates for critical content to preserve detail and higher rates for less critical background information, enhancing overall efficiency [3][11] - DeepSeek-OCR is expected to find early large-scale applications in document-heavy fields such as financial reporting and scientific literature [3][11] Industry Context - The report highlights the evolution of AI in China, noting that DeepSeek's innovations are gaining international recognition, although U.S. companies still hold advantages in systemic capabilities [6][14] - The focus is shifting from raw computational power to architectural insights and product engineering capabilities, indicating a path for differentiated development in the industry [6][14]
DeepSeek的终极野心:把大语言模型的基本语言都改造成图像
3 6 Ke· 2025-10-21 12:52
Core Insights - DeepSeek has open-sourced DeepSeek-OCR, an OCR model that achieves state-of-the-art results on benchmarks like OmniDocBench [1] - The motivation behind entering the OCR field is to address the computational bottleneck of long context processing in large language models (LLMs) [4][6] - The paper proposes that text information can be efficiently compressed through optical 2D mapping, allowing visual language models (VLMs) to decompress original information from images [4][6] Group 1: Long Context Processing - The pursuit of longer context in LLMs has led to a competitive arms race, with token windows expanding from thousands to millions [7] - The core limitation arises from the attention mechanism in the Transformer architecture, where computational complexity and memory usage grow quadratically with sequence length [7] - DeepSeek-AI's engineers propose a fundamental question: can the number of tokens be compressed rather than just optimizing attention calculations? [7][10] Group 2: Visual Tokens vs. Text Tokens - Visual tokens are the basic units of information processed by visual models, while text tokens are used by LLMs [8] - A 1024x1024 image can be divided into 4096 visual tokens, significantly reducing the number of tokens needed compared to text representation [9] - The understanding that visual modalities can serve as efficient compression mediums for text information led to the creation of DeepSeek-OCR [9] Group 3: DeepEncoder and Compression Techniques - DeepSeek-OCR is essentially a proof of concept for an "optical compression-decompression" system [10] - The DeepEncoder, a key innovation, is designed to handle high-resolution inputs while producing minimal visual tokens [11][12] - The architecture consists of three stages: a local detail processor, a compression module, and a global attention layer [14][16] Group 4: Performance Metrics - Experimental results show a 10.5x compression rate with 64 visual tokens decoding 600-700 text tokens, achieving an OCR accuracy of 96.5% [17][18] - At a 20x compression rate, the model maintains around 60% accuracy while decoding over 1200 text tokens [17][18] - DeepSeek-OCR outperforms existing models like GOT-OCR2.0 and MinerU2.0 in terms of performance and token efficiency [19][20] Group 5: Future Vision and Memory Simulation - The team aims to simulate human memory's forgetting mechanism, which naturally prioritizes relevant information while compressing less important details [25][27] - The multi-resolution design of DeepSeek-OCR provides a technical foundation for managing memory in a way that mimics human cognitive processes [29][30] - The ultimate goal is to create a system that balances information retention and computational efficiency, potentially leading to a new paradigm in AI memory and input systems [32][35]
DeepSeek开源新模型:单张A100日处理可超20万页数据
第一财经· 2025-10-20 14:58
Core Viewpoint - DeepSeek has released a new OCR model that utilizes visual modalities for efficient text compression, achieving significant reductions in token usage while maintaining high accuracy in text recognition [2][5][6]. Summary by Sections Model Overview - The new OCR model, named DeepSeek-OCR, was open-sourced on October 20 and is detailed in the paper "DeepSeek-OCR: Contexts Optical Compression" [2]. - The model addresses the computational challenges faced by large language models when processing lengthy text by compressing text into visual formats, achieving nearly 10 times lossless context compression while maintaining an OCR accuracy of over 97% [5][6]. Technical Specifications - The model can generate training data for large language models/visual language models at a rate of over 200,000 pages per day using a single A100-40G GPU [7]. - DeepSeek-OCR consists of two main components: DeepEncoder for image feature extraction and compression, and DeepSeek3B-MoE for reconstructing text from compressed visual tokens [7]. - The decoder employs a MoE (Mixture of Experts) design, activating 6 out of 64 experts, resulting in approximately 570 million active parameters, combining the expressive power of a 3 billion parameter model with the inference efficiency of a 500 million parameter model [7]. Experimental Results - Experimental data indicates that when the number of text tokens is within 10 times that of visual tokens (compression ratio less than 10), the model achieves an OCR accuracy of 97%. Even at a compression ratio of 20, the accuracy remains around 60% [7]. Future Directions - The team proposes a novel approach to simulate human memory decay through optical compression, gradually reducing the size of rendered images for older contexts to decrease token consumption, potentially leading to breakthroughs in handling ultra-long contexts [8]. Community Response - The release has garnered positive feedback, with over 1,400 stars on GitHub shortly after its launch, indicating strong interest in the model [9]. - The project was led by researchers with prior experience in developing advanced OCR systems, suggesting a solid foundation for the new model [9]. Market Position - There are concerns in the market regarding DeepSeek's pace of innovation, with some voices suggesting that the company may be focusing on internal development to prepare for future models [10].