Workflow
DeepSeek
icon
Search documents
DeepSeek OCR:醉翁之意不在酒
Founder Park· 2025-10-21 07:46
Core Viewpoint - DeepSeek-OCR is a new AI model that processes text in images by treating text as visual data, achieving a compression of 10 times while maintaining a recognition accuracy of 96.5% [7][11]. Group 1: Model Performance and Innovation - DeepSeek-OCR can compress a 1000-word article into just 100 visual tokens, showcasing its efficiency [7]. - The model offers multiple resolution options, requiring as few as 64 tokens for a 512 x 512 image and 256 tokens for a 1024 x 1024 image [13]. - The approach of using visual tokens for text recognition is not entirely novel but represents a significant step in productization and application [13][14]. Group 2: Industry Reactions and Future Directions - Notable figures in the AI community, such as Karpathy, have expressed interest in the model, suggesting that future large language models (LLMs) might benefit from image-based inputs instead of traditional text [11][15]. - The potential for DeepSeek-OCR to enhance the processing of mixed media (text, images, tables) in various applications is highlighted, as current visual models struggle with such tasks [15]. - The idea of simulating a forgetting mechanism through resolution adjustments is intriguing but raises questions about its applicability in digital systems compared to human cognition [15].
文本已死,视觉当立,Karpathy狂赞DeepSeek新模型,终结分词器时代
3 6 Ke· 2025-10-21 07:22
Core Insights - DeepSeek has made a significant breakthrough with its new model, DeepSeek-OCR, which fundamentally changes the input paradigm from text to visual data, suggesting that visual inputs may become the mainstream in AI applications [1][14][17] Performance Metrics - DeepSeek-OCR achieves approximately 2500 tokens per second on a single A100-40G card while maintaining a 97% OCR accuracy. It compresses visual context to 1/20 of its original size, with typical usage achieving a compression ratio of less than 1/10 [3][5] - The model can compress an entire page of dense text into just 100 visual tokens, achieving up to 60 times compression on the OmniDocBench benchmark [5][11] Technical Advantages - DeepSeek-OCR boasts fewer parameters, high compression rates, fast processing speeds, and support for 100 languages, making it both theoretically valuable and highly practical [7][11] - The model demonstrates that physical pages (like microfilm and books) are superior data sources for training AI models compared to low-quality internet text [11] Industry Implications - The shift from text to visual inputs could redefine how large language models process information, potentially eliminating the need for traditional tokenizers, which have been criticized for their inefficiencies [16][19] - Karpathy, a prominent figure in AI, emphasizes that the future may see all inputs for AI models being images, enhancing efficiency and information flow [15][25] Community Response - The open-source project has gained significant traction, receiving 4.4k stars on GitHub overnight, indicating strong community interest and support [10][46]
Karpathy盛赞DeepSeek-OCR“淘汰”tokenizer!实测如何用Claude Code 让新模型跑在N卡上
AI前线· 2025-10-21 04:54
Core Insights - DeepSeek has released a new model, DeepSeek-OCR, which is a 6.6GB model specifically fine-tuned for OCR, achieving a 10× near-lossless compression and a 20× compression while retaining 60% accuracy [2] - The model introduces DeepEncoder to address the trade-offs between high resolution, low memory, and fewer tokens, achieving state-of-the-art performance in practical scenarios with minimal token consumption [2][4] - The model's architecture is lightweight, consisting of only 12 layers, which is suitable for the pattern recognition nature of OCR tasks [5] Model Innovations - DeepSeek-OCR allows for rendering original content as images before input, leading to more efficient information compression and richer information flow [6] - The model eliminates the need for tokenizers, which have been criticized for their inefficiencies and historical baggage, thus enabling a more seamless end-to-end process [6] - It employs a "Mixture of Experts" paradigm, activating only 500 million parameters during inference, allowing for efficient processing of large datasets [7] Market Position and Future Implications - Alexander Doria, co-founder of Pleiasfr, views DeepSeek-OCR as a milestone achievement, suggesting it sets a foundation for future OCR systems [4][8] - The model's training pipeline includes a significant amount of synthetic and simulated data, indicating that while it has established a balance between inference efficiency and model performance, further customization for specific domains is necessary for large-scale real-world applications [8] Developer Engagement - The release has attracted many developers, with Simon Willison successfully running the model on NVIDIA Spark in about 40 minutes, showcasing the model's accessibility and ease of use [9][21] - Willison emphasized the importance of providing a clear environment and task definition for successful implementation, highlighting the model's practical utility [24]
DeepSeek的新模型很疯狂:整个AI圈都在研究视觉路线,Karpathy不装了
3 6 Ke· 2025-10-21 04:12
Core Insights - The introduction of DeepSeek-OCR has the potential to revolutionize the paradigm of large language models (LLMs) by suggesting that all inputs should be treated as images rather than text, which could lead to significant improvements in efficiency and context handling [1][3][8]. Group 1: Model Performance and Efficiency - DeepSeek-OCR can compress a 1000-word article into 100 visual tokens, achieving a compression efficiency that is ten times better than traditional text tokenization while maintaining a 97% accuracy rate [1][8]. - A single NVIDIA A100 GPU can process 200,000 pages of data daily using this model, indicating its high throughput capabilities [1]. - The model's approach to using visual tokens instead of text tokens could allow for a more efficient representation of information, potentially expanding the effective context size of LLMs significantly [9][10]. Group 2: Community Reception and Validation - The open-source release of DeepSeek-OCR garnered over 4000 stars on GitHub within a single night, reflecting strong interest and validation from the AI community [1]. - Notable figures in the AI field, such as Andrej Karpathy, have praised the model, indicating its potential impact and effectiveness [1][3]. Group 3: Theoretical Implications - The model's ability to represent text as visual tokens raises questions about how this might affect the cognitive capabilities of LLMs, particularly in terms of reasoning and language expression [9][10]. - The concept aligns with human cognitive processes, where visual memory plays a significant role in recalling information, suggesting a more natural way for models to process and retrieve data [9]. Group 4: Historical Context and Comparisons - While DeepSeek-OCR presents a novel approach, it is noted that similar ideas were previously explored in the 2022 paper "Language Modelling with Pixels," which proposed a pixel-based language encoder [14][16]. - The ongoing development in this area includes various research papers that build upon the foundational ideas of visual tokenization and its applications in multi-modal learning [16]. Group 5: Criticism and Challenges - Some researchers have criticized DeepSeek-OCR for lacking progressive development compared to human cognitive processes, suggesting that the model may not fully replicate human-like understanding [19].
MAU被豆包反超,Deepseek 挤了点牙膏
3 6 Ke· 2025-10-21 04:12
Core Insights - DeepSeek has launched DeepSeek-OCR, an open-source model with approximately 3 billion parameters, which enhances scanning efficiency through a "visual-text compression" approach [1][2][3] - DeepSeek has recently been surpassed by its competitor Doubao in terms of monthly active users (MAU), with Doubao reaching approximately 157 million MAU, a 6.6% increase, compared to DeepSeek's 143 million [1][9] - The competition between DeepSeek and Doubao highlights a shift in the C-end AI market, with Doubao leveraging its multi-modal capabilities and integration with the Douyin ecosystem [1][2][9] DeepSeek-OCR Model - DeepSeek-OCR utilizes a "visual-text compression" method, achieving superior performance with fewer visual markers compared to traditional OCR systems [3][4] - The model can decode with 97% accuracy at 10x compression and maintain 60% accuracy at 20x compression, significantly reducing computational costs [7][18] - DeepSeek-OCR includes a "deep parsing mode" that converts financial charts into structured data, facilitating the generation of editable analysis formats [6][18] Competitive Landscape - Doubao's success is attributed to its broad audience targeting and integration with ByteDance's social platforms, making it more accessible to general users compared to DeepSeek's more technical approach [9][10][12] - The branding and user experience of Doubao are designed to appeal to a wider audience, contrasting with DeepSeek's more niche positioning [10][12] - Despite being overtaken, DeepSeek maintains a significant user base and continues to focus on technical advancements, with its V3 series boasting a total parameter count of 671 billion [17][19] Future Considerations - DeepSeek's ability to leverage its large C-end user base and differentiate its ecosystem will be crucial for competing with Doubao [19] - The release of DeepSeek-OCR may serve as a catalyst for model training and enhance the efficiency of data processing for future model iterations [18][19] - The ongoing development of the R2 model has faced delays, impacting DeepSeek's competitive edge in the rapidly evolving AI landscape [8][15][19]
DeepSeek的新模型很疯狂:整个AI圈都在研究视觉路线,Karpathy不装了
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the groundbreaking release of the DeepSeek-OCR model, which compresses 1000 words into 100 visual tokens while maintaining a high accuracy of 97% [1] - This model addresses the long-context efficiency issue in large language models (LLMs) and suggests a paradigm shift where visual inputs may be more effective than textual inputs [1][5] Group 1: Model Features and Performance - DeepSeek-OCR can process 200,000 pages of data daily using a single NVIDIA A100 GPU [1] - The model's compression efficiency is ten times better than traditional text tokens, allowing for a significant reduction in the number of tokens needed to represent information [9] - The model eliminates the need for tokenizers, which have been criticized for their complexity and inefficiency [6] Group 2: Community Reception and Expert Opinions - The open-source nature of DeepSeek-OCR has led to widespread validation and excitement within the AI community, with over 4000 stars on GitHub shortly after its release [2][1] - Experts like Andrej Karpathy have praised the model, highlighting its potential to redefine how LLMs process inputs [3][5] - The model has sparked discussions about the efficiency of visual tokens compared to text tokens, with some researchers noting that visual representations may offer better performance in certain contexts [9][11] Group 3: Implications for Future Research - The article suggests that the use of visual tokens could significantly expand the effective context length of models, potentially allowing for the integration of extensive internal documents into prompts [12][13] - There are references to previous research that laid the groundwork for similar concepts, indicating that while DeepSeek-OCR is innovative, it is part of a broader trend in the field [18][20] - The potential for combining DeepSeek-OCR with other recent advancements, such as sparse attention mechanisms, is highlighted as a promising avenue for future exploration [11][12]
X @外汇交易员
外汇交易员· 2025-10-21 01:45
DeepSeek最新开源的模型DeepSeek-OCR受到海外开发者关注。该模型提出了“上下文光学压缩”的思路,通过用少量的视觉token来表示原本需要大量文本token的内容,以此降低大模型的计算开销。编码器DeepEncoder将图片转为高压视觉token,解码器DeepSeek3B-MoE-A570M则从视觉token重建文字,实现以小博大。甚至有开发者惊呼,DeepSeek-OCR的推出是“AI的JPEG时刻”。 ...
全新开源的DeepSeek-OCR,可能是最近最惊喜的模型。
数字生命卡兹克· 2025-10-21 01:32
Core Insights - The article discusses the introduction of DeepSeek-OCR, a new model that enhances traditional Optical Character Recognition (OCR) capabilities by not only extracting text but also generating structured documents and compressing information effectively [1][3][5]. Group 1: Traditional OCR vs. DeepSeek-OCR - Traditional OCR primarily converts images of text into editable digital text, which can be cumbersome for complex documents like financial reports [3][5]. - DeepSeek-OCR goes beyond traditional OCR by generating Markdown documents that maintain the structure of the original content, including text, titles, and charts, making it more versatile [5][6]. Group 2: Contextual Compression - DeepSeek-OCR introduces a novel approach called "Contextual Optical Compression," which allows the model to process long texts more efficiently by converting them into image files instead of tokenized text [18][19]. - This method significantly reduces the computational load associated with processing long texts, as the complexity of token processing increases quadratically with text length [8][10][11]. Group 3: Performance Metrics - The model achieves a remarkable compression ratio of up to 10 times while maintaining a recognition accuracy of 96.5% [23]. - The compression ratio is calculated by dividing the total number of original text tokens by the number of visual tokens after compression [24]. Group 4: Implications for AI and Memory - The article suggests that DeepSeek-OCR's approach mirrors human memory, where recent information is retained with high fidelity while older information gradually fades [39][40]. - This mechanism of "forgetting" is presented as a potential advantage for AI, allowing it to prioritize important information and manage memory more like humans do [40][41].
DeepSeek新模型被硅谷夸疯了!用二维视觉压缩一维文字,单GPU能跑,“谷歌核心机密被开源”
Hua Er Jie Jian Wen· 2025-10-21 00:27
Core Insights - DeepSeek has released an open-source model named DeepSeek-OCR, which is gaining significant attention in Silicon Valley for its innovative approach to processing long texts using visual compression techniques [1][4][21] - The model is designed to tackle the computational challenges associated with large models handling lengthy text, achieving high accuracy rates even with reduced token usage [1][4][5] Model Performance - DeepSeek-OCR operates with a model size of 3 billion parameters and demonstrates a remarkable ability to decode text with high accuracy, achieving 97% accuracy with a compression ratio of less than 10 times and maintaining 60% accuracy even at a 20 times compression ratio [1][4][5] - The model has been benchmarked against existing models, showing superior performance with significantly fewer visual tokens, such as using only 100 visual tokens to outperform models that require 256 tokens [7][8] Data Generation Efficiency - The model can generate over 200,000 pages of high-quality training data daily using a single A100-40G GPU, showcasing its efficiency in data generation [2][4] Innovative Approach - DeepSeek introduces a concept called "Contextual Optical Compression," which compresses textual information into visual formats, allowing the model to interpret content through images rather than text [4][10] - The architecture includes two main components: the DeepEncoder for converting images into compressed visual tokens and the DeepSeek3B-MoE-A570M for reconstructing text from these tokens [10][11] Flexibility and Adaptability - The DeepEncoder is designed to handle various input resolutions and token counts, allowing it to adapt to different compression needs and application scenarios [11][12] - The model supports complex image analyses, including financial reports and scientific diagrams, enhancing its applicability across diverse fields [12][14] Future Implications - The research suggests that this unified approach to visual and textual processing could be a step towards achieving Artificial General Intelligence (AGI) [4][21] - The team behind DeepSeek-OCR is exploring the potential of simulating human memory mechanisms through optical compression, which could lead to more efficient handling of long-term contexts in AI [20][21]
DeepSeek新模型被硅谷夸疯了!用二维视觉压缩一维文字,单GPU能跑,“谷歌核心机密被开源”
量子位· 2025-10-20 23:34
Core Insights - DeepSeek has released a groundbreaking open-source model named DeepSeek-OCR, which is gaining significant attention in Silicon Valley for its innovative approach to processing long texts with high efficiency [1][3][7]. Model Overview - The DeepSeek-OCR model addresses the computational challenges associated with large models handling long texts by utilizing a method that compresses textual information into visual tokens, thereby reducing the number of tokens needed for processing [5][12][13]. - The model achieves high accuracy rates, with a decoding accuracy of 97% when the compression ratio is less than 10 times and around 60% even at a 20 times compression ratio [6]. Performance Metrics - DeepSeek-OCR has demonstrated superior performance on the OmniDocBench benchmark, achieving state-of-the-art (SOTA) results with significantly fewer visual tokens compared to existing models [14][15]. - For instance, using only 100 visual tokens, DeepSeek-OCR outperforms the GOT-OCR2.0 model, which uses 256 tokens, and matches the performance of other models while using far fewer tokens [17]. Technical Components - The architecture of DeepSeek-OCR consists of two main components: the DeepEncoder, which converts high-resolution images into highly compressed visual tokens, and the DeepSeek3B-MoE-A570M decoder, which reconstructs text from these tokens [20][22]. - The model supports various input modes, allowing it to adapt its compression strength based on the specific task requirements [24]. Innovative Concepts - The research introduces the concept of "Contextual Optical Compression," which simulates human memory mechanisms by dynamically allocating computational resources based on the temporal context of the information being processed [36][38]. - This approach aims to enhance the model's ability to handle long conversations or documents, potentially leading to a more human-like memory structure in AI systems [39][41].