CLIP 模型
Search documents
今日暴论:Deepseek-OCR干翻了所有架构
自动驾驶之心· 2025-10-27 00:03
Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which significantly reduces the number of tokens required to store and process information by utilizing images as memory carriers instead of relying solely on text tokens [3][6][12]. Group 1: Model Capabilities - DeepSeek-OCR can store nearly the same amount of information using only one-tenth of the tokens compared to traditional models [40][41]. - In tests, DeepSeek-OCR achieved superior performance, using only 100 visual tokens to surpass the 256 tokens required by GOT-OCR 2.0, and less than 800 visual tokens to outperform MinerU 2.0, which typically requires over 6000 tokens [13][14]. - The model supports various resolutions and compression modes, allowing it to adapt to different document complexities, such as using only 64 visual tokens for simple documents [18][21]. Group 2: Data Collection and Utilization - DeepSeek-OCR can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [32][33]. - The model can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its efficiency in data collection [35]. Group 3: Resource Efficiency - By using images for memory, DeepSeek-OCR reduces the computational load, allowing for a significant decrease in token usage without sacrificing performance [40][41]. - The model can maintain 96.5% accuracy while using only one-tenth of the original token count, demonstrating its effectiveness in resource management [41][42]. Group 4: Open Source and Community Contributions - The development of DeepSeek-OCR is a collaborative effort, utilizing various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [51][53]. - The integration of multiple open-source models has enabled DeepSeek to create an AI capable of "thinking in images," showcasing the power of community-driven innovation [53].
DeepSeek开源的新模型,有点邪门
创业邦· 2025-10-25 10:14
Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which utilizes images to store information instead of relying solely on text tokens, significantly improving data compression and model efficiency [5][11][26]. Group 1: Model Functionality - DeepSeek-OCR can convert large amounts of text into images, serving as a memory carrier for AI, which allows for more efficient data storage [9][14]. - The model demonstrates superior performance by using fewer visual tokens compared to traditional models, achieving better results with less resource consumption [11][26]. - In tests, DeepSeek-OCR used only 100 visual tokens to outperform GOT-OCR 2.0, which required 256 tokens, and it achieved results with less than 800 visual tokens compared to over 6000 tokens for MinerU 2.0 [11][14]. Group 2: Data Collection and Utilization - The model can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [22][24]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [24]. - The model's ability to remember the position of images and surrounding text allows for a more comprehensive understanding of the data [18][22]. Group 3: Resource Efficiency - By using image-based memory, DeepSeek-OCR can reduce the number of tokens required to one-tenth of the original, while maintaining a high accuracy rate of 96.5% [26][27]. - The model's design allows for dynamic adjustments in token usage based on the complexity of the document, optimizing resource allocation [14][15]. - The research indicates that even with a 20-fold compression, the model can retain around 60% accuracy, showcasing its robustness [27]. Group 4: Open Source Collaboration - DeepSeek-OCR is an open-source project that integrates contributions from various global open-source communities, utilizing datasets and models from companies like Huawei, Baidu, Meta, and OpenAI [32][34]. - This collaborative effort has resulted in a model capable of "thinking in images," highlighting the importance of community-driven innovation in AI development [34].
DeepSeek昨天开源的新模型,有点邪门
3 6 Ke· 2025-10-22 01:00
Core Insights - DeepSeek has introduced a new model called DeepSeek-OCR, which can compress text information into images, achieving a significant reduction in token usage while maintaining high accuracy [5][31][39]. Group 1: Model Capabilities - DeepSeek-OCR can store large amounts of text as images, allowing for a more efficient representation of information compared to traditional text-based models [9][10]. - The model demonstrates a compression ratio where it can use only 100 visual tokens to outperform previous models that required 256 tokens, and it can achieve results with less than 800 visual tokens compared to over 6000 tokens used by other models [14][31]. - DeepSeek-OCR supports various resolutions and compression modes, adapting to different document complexities, with modes ranging from Tiny to Gundam, allowing for dynamic adjustments based on content [17][18]. Group 2: Data Utilization - The model can capture previously unutilized data from documents, such as graphs and images, which traditional models could not interpret effectively [24][26]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [29]. - By utilizing image memory, the model reduces the computational load significantly, allowing for a more efficient processing of longer conversations without a proportional increase in resource consumption [31]. Group 3: Open Source Collaboration - The development of DeepSeek-OCR is a collaborative effort, integrating various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [38][39]. - The model's architecture reflects a collective achievement from the open-source community, showcasing the potential of collaborative innovation in AI development [39].