Workflow
OCR技术
icon
Search documents
混元OCR模型核心技术揭秘:统一框架、真端到端
量子位· 2025-11-29 04:02
HunyuanOCR模型团队 投稿 量子位 | 公众号 QbitAI 腾讯混元大模型团队 正式发布并开源HunyuanOCR模型 ! 这是一款商业级、开源且轻量 (1B参数) 的OCR专用视觉语言模型,模型采用原生ViT和轻量LLM结合的架构。 具体而言,其感知能力 (文本检测和识别、复杂文档解析) 优于所有公开方案;语义能力 (信息抽取、文字图像翻译) 表现出色,荣获 ICDAR 2025 DIMT挑战赛 (小模型赛道) 冠军,并在OCRBench上取得3B以下模型SOTA成绩。 目前, 该模型在抱抱脸(Hugging Face)趋势榜排名前四,GitHub标星超过700, 并在 Day 0 被vllm官方团队接入。 团队介绍,混元OCR专家模型实现了三大突破: (1)全能与高效统一 。 在轻量框架下支持文字检测和识别、复杂文档解析、开放字段信息抽取、视觉问答和拍照图像翻译,解决了传统专家模型功能单一和通用视 觉理解大模型效率低下的痛点。 (2)极简端到端架构 。 摒弃版面分析等前处理依赖,彻底解决流水线错误累积问题,大幅简化部署。 HunyuanOCR采用由原生分辨率视觉编码器、自适应MLP连接器和轻量级 ...
只有0.9B的PaddleOCR-VL,却是现在最强的OCR模型。
数字生命卡兹克· 2025-10-23 01:33
Core Viewpoint - The article highlights the significant advancements in the OCR (Optical Character Recognition) field, particularly focusing on the PaddleOCR-VL model developed by Baidu, which has achieved state-of-the-art (SOTA) performance in document parsing tasks [2][9][45]. Summary by Sections Introduction to OCR Trends - The term OCR has gained immense popularity in the AI community, especially with the emergence of DeepSeek-OCR, which has revitalized interest in the OCR sector [1][2]. Overview of PaddleOCR-VL - PaddleOCR is not a new project; it has been developed by Baidu over several years, with its origins dating back to 2020. It has evolved into the most popular open-source OCR project, currently leading in GitHub stars with 60K [6][7]. - The PaddleOCR-VL model is the latest addition to this series, marking the first time a large model has been applied to the core of OCR document parsing [9][11]. Performance Metrics - PaddleOCR-VL, with only 0.9 billion parameters, has achieved SOTA across all categories in the OmniDocBench v1.5 evaluation set, scoring 92.56 overall [11][12]. - In comparison, DeepSeek-OCR scored 86.46, indicating that PaddleOCR-VL outperforms it by approximately 6 points [14][15]. Model Architecture and Efficiency - PaddleOCR-VL employs a two-step architecture for efficiency: first, a traditional visual model (PP-DocLayoutV2) performs layout analysis, and then the PaddleOCR-VL model processes smaller, framed images for text recognition [18][20]. - This approach allows PaddleOCR-VL to achieve high accuracy without the need for a larger model, demonstrating that effective solutions can often be more about problem-solving than sheer size [16][20]. Practical Applications and Testing - PaddleOCR-VL has shown impressive results in various challenging scenarios, including processing scanned PDFs, handwritten notes, and complex layouts like academic papers and invoices [22][28][34]. - The model's ability to accurately recognize and extract information from structured documents, such as tables, has been particularly noted as a significant advantage for automating data extraction processes [39][41]. Conclusion and Future Prospects - PaddleOCR-VL is now open-source, allowing users to deploy it locally or use it through various demo platforms [44][45]. - The advancements made by both PaddleOCR-VL and DeepSeek-OCR are recognized as significant contributions to the OCR field, each excelling in their respective areas [45][46].
智谱运气是差一点点,视觉Token研究又和DeepSeek撞车了
量子位· 2025-10-22 15:27
Core Viewpoint - The article discusses the competition between Zhipu and DeepSeek in the AI field, particularly focusing on the release of Zhipu's visual token solution, Glyph, which aims to address the challenges of long context in large language models (LLMs) [1][2][6]. Group 1: Context Expansion Challenges - The demand for long context in LLMs is increasing due to various applications such as document analysis and multi-turn dialogues [8]. - Expanding context length significantly increases computational costs; for instance, increasing context from 50K to 100K tokens can quadruple the computational consumption [9][10]. - Merely adding more tokens does not guarantee improved model performance, as excessive input can lead to noise interference and information overload [12][14]. Group 2: Existing Solutions - Three mainstream solutions to the long context problem are identified: 1. **Extended Position Encoding**: This method extends the existing position encoding range to accommodate longer inputs without retraining the model [15][16]. 2. **Attention Mechanism Modification**: Techniques like sparse and linear attention aim to improve token processing efficiency, but do not reduce the total token count [20][21]. 3. **Retrieval-Augmented Generation (RAG)**: This approach uses external retrieval to shorten inputs, but may slow down overall response time [22][23]. Group 3: Glyph Framework - Glyph proposes a new paradigm by converting long texts into images, allowing for higher information density and efficient processing by visual language models (VLMs) [25][26]. - By using visual tokens, Glyph can significantly reduce the number of tokens needed; for example, it can represent the entire text of "Jane Eyre" using only 80K visual tokens compared to 240K text tokens [32][36]. - The training process for Glyph involves three stages: continual pre-training, LLM-driven rendering search, and post-training, which collectively enhance the model's ability to interpret visual information [37][44]. Group 4: Performance and Results - Glyph achieves a token compression rate of 3-4 times while maintaining accuracy comparable to mainstream models [49]. - The implementation of Glyph results in approximately four times faster prefill and decoding speeds, as well as two times faster supervised fine-tuning (SFT) training [51]. - Glyph demonstrates strong performance in multimodal tasks, indicating its robust generalization capabilities [53]. Group 5: Contributors and Future Implications - The primary author of the paper is Jiale Cheng, a PhD student at Tsinghua University, with contributions from Yusen Liu, Xinyu Zhang, and Yulin Fei [57][62]. - The article suggests that visual tokens may redefine the information processing methods of LLMs, potentially leading to pixels replacing text as the fundamental unit of AI input [76][78].
泰对外贸易厅支持企业使用 DFT SMART C/O 系统推动泰国出口
Shang Wu Bu Wang Zhan· 2025-09-18 07:49
Core Insights - The Thai Ministry of Commerce is enhancing the DFT SMART-I system by integrating artificial intelligence and OCR technology to fully digitize export and import licensing and certification services, aiming to facilitate businesses, reduce costs, and improve the competitiveness of Thai products in international markets [1] Group 1: System Features - The DFT SMART C/O system allows businesses to apply and track progress online using only their ID cards [1] - Approved documents can be self-printed, and electronic payment options are available, eliminating the need for in-person collection, which significantly saves time and costs [1] Group 2: Implementation Timeline and Scope - From December 15, 2023, to August 2025, the system has issued 12 types of certificates of origin, covering specific goods under RCEP, ASEAN agreements, and trade with Japan, Australia, Peru, and the European Union [1]