用图像代替文本作为 LLM 输入
Search documents
DeepSeek OCR:醉翁之意不在酒
Founder Park· 2025-10-21 07:46
Core Viewpoint - DeepSeek-OCR is a new AI model that processes text in images by treating text as visual data, achieving a compression of 10 times while maintaining a recognition accuracy of 96.5% [7][11]. Group 1: Model Performance and Innovation - DeepSeek-OCR can compress a 1000-word article into just 100 visual tokens, showcasing its efficiency [7]. - The model offers multiple resolution options, requiring as few as 64 tokens for a 512 x 512 image and 256 tokens for a 1024 x 1024 image [13]. - The approach of using visual tokens for text recognition is not entirely novel but represents a significant step in productization and application [13][14]. Group 2: Industry Reactions and Future Directions - Notable figures in the AI community, such as Karpathy, have expressed interest in the model, suggesting that future large language models (LLMs) might benefit from image-based inputs instead of traditional text [11][15]. - The potential for DeepSeek-OCR to enhance the processing of mixed media (text, images, tables) in various applications is highlighted, as current visual models struggle with such tasks [15]. - The idea of simulating a forgetting mechanism through resolution adjustments is intriguing but raises questions about its applicability in digital systems compared to human cognition [15].