Workflow
PaddleOCR
icon
Search documents
今日暴论:Deepseek-OCR干翻了所有架构
自动驾驶之心· 2025-10-27 00:03
Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which significantly reduces the number of tokens required to store and process information by utilizing images as memory carriers instead of relying solely on text tokens [3][6][12]. Group 1: Model Capabilities - DeepSeek-OCR can store nearly the same amount of information using only one-tenth of the tokens compared to traditional models [40][41]. - In tests, DeepSeek-OCR achieved superior performance, using only 100 visual tokens to surpass the 256 tokens required by GOT-OCR 2.0, and less than 800 visual tokens to outperform MinerU 2.0, which typically requires over 6000 tokens [13][14]. - The model supports various resolutions and compression modes, allowing it to adapt to different document complexities, such as using only 64 visual tokens for simple documents [18][21]. Group 2: Data Collection and Utilization - DeepSeek-OCR can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [32][33]. - The model can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its efficiency in data collection [35]. Group 3: Resource Efficiency - By using images for memory, DeepSeek-OCR reduces the computational load, allowing for a significant decrease in token usage without sacrificing performance [40][41]. - The model can maintain 96.5% accuracy while using only one-tenth of the original token count, demonstrating its effectiveness in resource management [41][42]. Group 4: Open Source and Community Contributions - The development of DeepSeek-OCR is a collaborative effort, utilizing various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [51][53]. - The integration of multiple open-source models has enabled DeepSeek to create an AI capable of "thinking in images," showcasing the power of community-driven innovation [53].
DeepSeek开源的新模型,有点邪门
创业邦· 2025-10-25 10:14
Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which utilizes images to store information instead of relying solely on text tokens, significantly improving data compression and model efficiency [5][11][26]. Group 1: Model Functionality - DeepSeek-OCR can convert large amounts of text into images, serving as a memory carrier for AI, which allows for more efficient data storage [9][14]. - The model demonstrates superior performance by using fewer visual tokens compared to traditional models, achieving better results with less resource consumption [11][26]. - In tests, DeepSeek-OCR used only 100 visual tokens to outperform GOT-OCR 2.0, which required 256 tokens, and it achieved results with less than 800 visual tokens compared to over 6000 tokens for MinerU 2.0 [11][14]. Group 2: Data Collection and Utilization - The model can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [22][24]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [24]. - The model's ability to remember the position of images and surrounding text allows for a more comprehensive understanding of the data [18][22]. Group 3: Resource Efficiency - By using image-based memory, DeepSeek-OCR can reduce the number of tokens required to one-tenth of the original, while maintaining a high accuracy rate of 96.5% [26][27]. - The model's design allows for dynamic adjustments in token usage based on the complexity of the document, optimizing resource allocation [14][15]. - The research indicates that even with a 20-fold compression, the model can retain around 60% accuracy, showcasing its robustness [27]. Group 4: Open Source Collaboration - DeepSeek-OCR is an open-source project that integrates contributions from various global open-source communities, utilizing datasets and models from companies like Huawei, Baidu, Meta, and OpenAI [32][34]. - This collaborative effort has resulted in a model capable of "thinking in images," highlighting the importance of community-driven innovation in AI development [34].
DeepSeek昨天开源的新模型,有点邪门
3 6 Ke· 2025-10-22 01:00
Core Insights - DeepSeek has introduced a new model called DeepSeek-OCR, which can compress text information into images, achieving a significant reduction in token usage while maintaining high accuracy [5][31][39]. Group 1: Model Capabilities - DeepSeek-OCR can store large amounts of text as images, allowing for a more efficient representation of information compared to traditional text-based models [9][10]. - The model demonstrates a compression ratio where it can use only 100 visual tokens to outperform previous models that required 256 tokens, and it can achieve results with less than 800 visual tokens compared to over 6000 tokens used by other models [14][31]. - DeepSeek-OCR supports various resolutions and compression modes, adapting to different document complexities, with modes ranging from Tiny to Gundam, allowing for dynamic adjustments based on content [17][18]. Group 2: Data Utilization - The model can capture previously unutilized data from documents, such as graphs and images, which traditional models could not interpret effectively [24][26]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [29]. - By utilizing image memory, the model reduces the computational load significantly, allowing for a more efficient processing of longer conversations without a proportional increase in resource consumption [31]. Group 3: Open Source Collaboration - The development of DeepSeek-OCR is a collaborative effort, integrating various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [38][39]. - The model's architecture reflects a collective achievement from the open-source community, showcasing the potential of collaborative innovation in AI development [39].
百度PaddleOCR累计下载量突破900万
Xin Lang Cai Jing· 2025-09-18 09:06
Core Insights - Baidu has introduced its latest lightweight text recognition model, PP-OCRv5, which has only 0.07 billion parameters and achieves OCR accuracy comparable to models with 70 billion parameters using just one-thousandth of the parameter count [1] Summary by Categories Product Development - The PP-OCRv5 model represents a significant advancement in OCR technology, showcasing high efficiency with minimal parameters [1] Market Impact - Since its open-source launch in 2020, PaddleOCR has surpassed 9 million downloads and has been utilized directly or indirectly by over 5,900 open-source projects, highlighting its widespread adoption [1] - PaddleOCR is the only Chinese OCR project on GitHub with over 50,000 stars, indicating strong community support and recognition [1]
前OpenAI、DeepMind研究员领衔,50+位专家谈AI编程、Agent与具身智能,2025全球机器学习技术大会议程首发!
AI科技大本营· 2025-08-29 10:06
Core Insights - The article emphasizes the transition of AI from impressive demos to a rigorous focus on architecture, systems, data, and business integration, highlighting the need for sustainable industrial capabilities [1] - The 2025 Global Machine Learning Technology Summit, organized by CSDN and Singularity Research Institute, will take place on October 16-17 in Beijing, featuring over 50 prominent speakers from academia and industry [1][3] Group 1: Event Overview - The summit aims to address the pressing question of how to transform technological breakthroughs into sustainable industrial capabilities [1] - A comprehensive "full-stack battle map" of AI has been designed, featuring 12 core topics including the evolution of large language models, AI-enabled software development, and practical applications of large models [3][4] Group 2: Key Speakers and Topics - Zhao Jian will discuss AI safety and governance, focusing on the security risks and ethical challenges of large models, along with innovative governance solutions [5][8] - Zhou Pan will present the MindGPT-4o-Audio, a real-time voice dialogue model that achieves human-like interaction capabilities [11][14] - Leng Dawei will share insights on FG-CLIP, a high-precision image-text alignment model designed for large-scale applications [16][19] - Zhang Heng will explore the transition from academic research to commercial AI visual algorithms, detailing the development process from prototypes to products [20][24] - Zhang Jun will introduce the Wenxin 4.5 open-source model and its key training technologies, addressing challenges in model training and inference [25][29] - Zhang Dao Xin will discuss the application of multimodal models in Xiaohongshu's search functionalities, focusing on content understanding and retrieval systems [30][33] - Han Ai will present the OxyGent framework for multi-agent collaboration in JD Retail, emphasizing its modular design for flexible system development [34][37] - Wang Peiyu will cover advancements in multimodal reasoning and unified models, showcasing the evolution of the r1v series [39][42] - Cui Cheng will discuss the latest technologies in PaddleOCR and its applications in various industries [43][46] - Xiao Chaojun will introduce MiniCPM, an efficient model for edge devices, highlighting breakthroughs in architecture and training algorithms [47][49] - Chen Yingfeng will explore the application of embodied intelligence in engineering machinery, focusing on human-robot collaboration [50][53] - Zhang Shaobo will present the LLM Agent's role in software engineering, demonstrating its capabilities in solving real development challenges [54][57] - Zhang Dan will discuss how AI large models can help overcome challenges in L4 autonomous driving, sharing insights on commercial applications [58][61] - Han Zongbo will address uncertainty modeling in AI, providing a framework for enhancing reliability in complex scenarios [62][65] Group 3: Future Directions - The summit serves as a platform for deep exchanges in AI technology, fostering collaboration and innovation across industries [74] - The event aims to capture cutting-edge trends and explore pathways for industrial upgrades, inviting global AI participants to engage in discussions [74]