Workflow
上下文光学压缩
icon
Search documents
DeepSeek-OCR实现光学压缩 光计算可为大模型“减负”
3 6 Ke· 2025-11-27 08:49
Group 1 - The core idea of the article revolves around the concept of optical compression of context using visual tokens to address the computational challenges faced by large language models as context window sizes increase [2][3] - DeepSeek's research demonstrates that visual compression can maintain high accuracy, achieving a compression rate of 10 times while retaining 96.5% precision [3][4] - The DeepEncoder module is identified as the key engine for achieving optical compression, utilizing components such as the SAM module, convolutional blocks, and CLIP to effectively compress data from 1000 text tokens to 100 visual tokens [5][7] Group 2 - Optical computing is highlighted as a more suitable solution for context compression due to its ability to handle the information aggregation processes inherent in ViT and CNN structures more efficiently than traditional electronic chips [7][9] - The advantages of optical computing include simplified computation processes and scalability, allowing for enhanced parallelism and dynamic programmability, which are crucial for long text reasoning tasks [9][11] - Future plans involve exploring algorithms based on human memory mechanisms and developing specialized hardware for context compression and AI tasks, aiming to connect optical computing with large models [13][15] Group 3 - The article emphasizes the need for optical computing to overcome the limitations of traditional GPUs, particularly in terms of memory constraints and power density, as large models become more prevalent [15] - The company aims to build a next-generation disruptive platform system for large-scale AI computing, providing comprehensive optical computing solutions across various scenarios [15]
DeepSeek悄悄上线新模型
21世纪经济报道· 2025-10-30 10:42
Core Insights - DeepSeek has released a new multimodal model called DeepSeek-OCR, which has sparked significant discussion in the industry regarding its potential applications in optical and quantum computing [1] - The model's visual encoder enables efficient decoding, providing a clear technical pathway for integrating optical computing into large language models (LLMs) [1] Group 1: Contextual Optical Compression - DeepSeek has introduced "Contextual Optical Compression" technology, which processes text as images to achieve efficient information compression, theoretically allowing for infinite context [3] - This technology can compress tokens by 7 to 20 times; for instance, converting a page of text that typically requires 2000-5000 tokens down to just 200-400 visual tokens [3][4] - The model maintains 97% decoding accuracy at 20x compression, with 60% accuracy still achievable at 20x compression, which is crucial for implementing LLM memory's forgetting mechanism [4] Group 2: Optical Computing Integration - By transforming text problems into image problems, DeepSeek's OCR technology may pave the way for the integration of optical computing chips into large language models [5] - Optical computing chips are seen as a potential technology for the "post-Moore era," leveraging light-speed transmission, high parallelism, and low power consumption for AI and other computation-intensive tasks [5] - The DeepEncoder component of DeepSeek-OCR is particularly suited for execution by optical co-processors, while the text decoding will still be handled by electronic chips [5] Group 3: Challenges and Industry Landscape - Current challenges for optical computing include advanced optoelectronic integration and the maturity of the software ecosystem, which hinder large-scale development and optimization [6] - Key players in the domestic market include companies like Xizhi Technology and Turing Quantum, while international competitors include Lightmatter and Cerebras Systems [6][7] - Turing Quantum has made significant progress in the mass production of thin-film lithium niobate (TFLN) products, but it may take 3 to 5 years to compete with GPUs in data centers due to engineering, cost, and ecosystem challenges [7]
DeepSeek-OCR:大模型技术,正站在一个新的十字路口
3 6 Ke· 2025-10-22 23:15
Core Insights - DeepSeek has introduced "DeepSeek-OCR," a model that utilizes "Context Optical Compression," significantly enhancing the efficiency of processing textual information from images [1][2][7] - The model demonstrates that images can serve as efficient carriers of information, challenging the traditional reliance on text-based processing [2][6] Group 1: Image Processing Efficiency - DeepSeek-OCR processes documents by treating text as images, compressing entire pages into a few visual tokens, achieving a tenfold efficiency increase with a 97% accuracy rate [1][2] - Traditional methods require thousands of tokens for a lengthy article, while DeepSeek-OCR only needs about 100 visual tokens, allowing it to handle long documents without resource constraints [2][3] Group 2: System Architecture and Functionality - The system consists of two modules: a powerful DeepEncoder that captures page information and a lightweight text generator that converts visual tokens into readable output [3] - The encoder combines local analysis and global understanding, reducing the initial 4096 tokens to just 256, showcasing a 90% reduction compared to competitors [3][4] - In practical tests, a single A100 GPU can process over 200,000 pages daily, with potential scalability to 33 million pages across multiple servers [3][4] Group 3: Information Density and Model Training - The paradox of image data being more efficient lies in its information density; images can encapsulate more data compactly compared to text tokens, which require extensive dimensional expansion [4][5] - While DeepSeek-OCR proves the feasibility of visual tokens, training purely visual models remains a challenge due to the ambiguity in predicting image segments [5][9] Group 4: Potential Impact and Applications - If widely adopted, this technology could transform the "token economy," significantly reducing processing costs for long documents and enhancing data extraction from complex formats [6][7] - It could also improve chatbots' long-term memory by converting old conversations into low-resolution images, simulating human memory decay while extending context without increasing token consumption [6][11] Group 5: Conclusion - The exploration of DeepSeek-OCR not only achieves a tenfold efficiency improvement but also redefines the boundaries of document processing, challenging existing limitations and optimizing cost structures [7][8]
重磅,DeepSeek再开源:视觉即压缩,100个token干翻7000个
3 6 Ke· 2025-10-21 01:35
Core Insights - DeepSeek-OCR model explores the boundaries of visual-text compression, demonstrating the ability to decode over ten times the amount of text information from a limited number of visual tokens, thus addressing long-context issues in large language models (LLMs) [1][16]. Group 1: Model Performance - In the OmniDocBench benchmark, DeepSeek-OCR surpasses GOT-OCR2.0 by using only 100 visual tokens compared to GOT-OCR2.0's 256 tokens per page [2][44]. - The model shows practical value in OCR tasks, achieving a compression ratio of 10.5× to 19.7× while maintaining high decoding accuracy, with precision rates around 97% within a 10× compression ratio [37][41]. Group 2: Technical Architecture - DeepSeek-OCR employs an end-to-end visual language model (VLM) architecture consisting of an encoder (DeepEncoder) and a decoder (DeepSeek-3B-MoE) [21][34]. - The encoder, DeepEncoder, utilizes a novel architecture with approximately 380 million parameters, combining SAM-base and CLIP-large for feature extraction and tokenization [23][24]. Group 3: Compression Capabilities - The model can achieve a compression ratio of 7 to 20 times in different historical contexts, providing a feasible direction for addressing long-context issues in LLMs [16]. - DeepSeek-OCR can generate training data for LLMs/VLMs at a rate of 33 million pages daily using 20 computing nodes, each equipped with 8 A100-40G GPUs [39]. Group 4: Multilingual and Application Scope - DeepSeek-OCR is capable of processing nearly 100 languages, enhancing its applicability in global contexts [43]. - The model can also interpret charts, chemical equations, simple geometric shapes, and natural images, showcasing its versatility in various document types [43][44].
DeepSeek新模型被硅谷夸疯了!用二维视觉压缩一维文字,单GPU能跑,“谷歌核心机密被开源”
Hua Er Jie Jian Wen· 2025-10-21 00:27
Core Insights - DeepSeek has released an open-source model named DeepSeek-OCR, which is gaining significant attention in Silicon Valley for its innovative approach to processing long texts using visual compression techniques [1][4][21] - The model is designed to tackle the computational challenges associated with large models handling lengthy text, achieving high accuracy rates even with reduced token usage [1][4][5] Model Performance - DeepSeek-OCR operates with a model size of 3 billion parameters and demonstrates a remarkable ability to decode text with high accuracy, achieving 97% accuracy with a compression ratio of less than 10 times and maintaining 60% accuracy even at a 20 times compression ratio [1][4][5] - The model has been benchmarked against existing models, showing superior performance with significantly fewer visual tokens, such as using only 100 visual tokens to outperform models that require 256 tokens [7][8] Data Generation Efficiency - The model can generate over 200,000 pages of high-quality training data daily using a single A100-40G GPU, showcasing its efficiency in data generation [2][4] Innovative Approach - DeepSeek introduces a concept called "Contextual Optical Compression," which compresses textual information into visual formats, allowing the model to interpret content through images rather than text [4][10] - The architecture includes two main components: the DeepEncoder for converting images into compressed visual tokens and the DeepSeek3B-MoE-A570M for reconstructing text from these tokens [10][11] Flexibility and Adaptability - The DeepEncoder is designed to handle various input resolutions and token counts, allowing it to adapt to different compression needs and application scenarios [11][12] - The model supports complex image analyses, including financial reports and scientific diagrams, enhancing its applicability across diverse fields [12][14] Future Implications - The research suggests that this unified approach to visual and textual processing could be a step towards achieving Artificial General Intelligence (AGI) [4][21] - The team behind DeepSeek-OCR is exploring the potential of simulating human memory mechanisms through optical compression, which could lead to more efficient handling of long-term contexts in AI [20][21]
刚刚,DeepSeek重要突破,大模型上下文紧箍咒打破
3 6 Ke· 2025-10-20 23:22
Core Insights - DeepSeek has introduced a novel technology path in the competition of large language models by open-sourcing the DeepSeek-OCR model, which proposes the concept of "Contextual Optical Compression" for efficient information compression through text-to-image conversion [1][8]. Group 1: Model Performance and Capabilities - The feasibility of DeepSeek-OCR has been validated, achieving a decoding accuracy of 97% at a 10x compression ratio, indicating near-lossless compression, while maintaining approximately 60% accuracy at a 20x compression ratio [3][21]. - DeepSeek-OCR can express similar textual content using fewer tokens by converting text tokens into visual tokens, providing a new approach to address the high computational costs associated with processing long texts in large language models [6][11]. - In practical applications, DeepSeek-OCR surpassed GOT-OCR 2.0 using only 100 visual tokens and outperformed MinerU 2.0 with less than 800 visual tokens, demonstrating its efficiency [6][23]. Group 2: Technical Architecture - The architecture of DeepSeek-OCR consists of two main components: DeepEncoder, a visual encoder designed for high compression and high-resolution document processing, and DeepSeek3B-MoE, a lightweight mixture of experts language decoder [12][18]. - DeepEncoder employs a dual-structure design combining local and global attention to achieve high-fidelity visual understanding, significantly reducing the number of vision tokens generated from document images [14][18]. Group 3: Data and Training - DeepSeek-OCR's training process is relatively straightforward, involving independent training of DeepEncoder and the complete DeepSeek-OCR model, utilizing a large dataset for effective learning [20][21]. - The model has been trained on a diverse dataset that includes OCR 1.0 and OCR 2.0 data, general visual data, and pure text data, ensuring robust performance across various document types [25][36]. Group 4: Application and Future Directions - DeepSeek-OCR demonstrates capabilities in deep parsing, allowing it to recognize and extract structured information from various document types, including financial reports and scientific literature [24][29]. - The research team plans to further explore the integration of digital and optical text pre-training methods and evaluate the performance of optical compression in real long-text environments, indicating a promising direction for future research [39].