今日暴论:Deepseek-OCR干翻了所有架构
自动驾驶之心·2025-10-27 00:03

Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which significantly reduces the number of tokens required to store and process information by utilizing images as memory carriers instead of relying solely on text tokens [3][6][12]. Group 1: Model Capabilities - DeepSeek-OCR can store nearly the same amount of information using only one-tenth of the tokens compared to traditional models [40][41]. - In tests, DeepSeek-OCR achieved superior performance, using only 100 visual tokens to surpass the 256 tokens required by GOT-OCR 2.0, and less than 800 visual tokens to outperform MinerU 2.0, which typically requires over 6000 tokens [13][14]. - The model supports various resolutions and compression modes, allowing it to adapt to different document complexities, such as using only 64 visual tokens for simple documents [18][21]. Group 2: Data Collection and Utilization - DeepSeek-OCR can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [32][33]. - The model can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its efficiency in data collection [35]. Group 3: Resource Efficiency - By using images for memory, DeepSeek-OCR reduces the computational load, allowing for a significant decrease in token usage without sacrificing performance [40][41]. - The model can maintain 96.5% accuracy while using only one-tenth of the original token count, demonstrating its effectiveness in resource management [41][42]. Group 4: Open Source and Community Contributions - The development of DeepSeek-OCR is a collaborative effort, utilizing various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [51][53]. - The integration of multiple open-source models has enabled DeepSeek to create an AI capable of "thinking in images," showcasing the power of community-driven innovation [53].