Workflow
DeepSeek新模型被硅谷夸疯了!
华尔街见闻·2025-10-21 10:13

Core Viewpoint - DeepSeek has introduced a groundbreaking model called DeepSeek-OCR, which utilizes a novel approach of "contextual optical compression" to efficiently process long texts by compressing textual information into visual tokens, significantly reducing computational costs while maintaining high accuracy in document parsing [5][13][14]. Summary by Sections Model Overview - DeepSeek-OCR is designed to tackle the computational challenges associated with processing long texts, achieving a high accuracy of 97% when the compression ratio is below 10 times, and maintaining around 60% accuracy even at a 20 times compression ratio [6][15]. - The model has gained significant attention, quickly accumulating 3.3K stars on GitHub and ranking second on HuggingFace's hot list [7]. Technical Innovations - The model comprises two core components: the DeepEncoder, which converts images into highly compressed visual tokens, and the DeepSeek3B-MoE-A570M decoder, which reconstructs text from these tokens [19][20]. - The DeepEncoder employs a serial design that processes high-resolution images in three stages: local feature extraction, token compression, and global understanding, allowing it to produce a minimal number of high-density visual tokens [21][22]. Performance Metrics - DeepSeek-OCR outperforms existing models by using only 100 visual tokens to exceed the performance of GOT-OCR2.0, which uses 256 tokens per page [18][19]. - The model supports various input modes, allowing it to adapt its compression strength based on specific tasks, ranging from "Tiny" (64 tokens) to "Gundam" (up to 800 tokens) [23][25]. Future Implications - The research suggests that the unified approach of visual and textual processing may be a pathway toward achieving Artificial General Intelligence (AGI) [11]. - The team has also proposed a concept of simulating human memory's forgetting mechanism through optical compression, potentially enabling models to allocate computational resources dynamically based on the context's temporal relevance [34][37][38].