Workflow
OCR
icon
Search documents
LiteParse: Local Document Parsing for AI Agents
LlamaIndex· 2026-03-19 12:10
Hi everyone, Felia from Lindex here and today I am really exciting to introduce you to a new tool that we are releasing today. Light pars an open-source parsing tool with spatial text extraction that doesn't have any LLM or cloud dependency. Lights has a really fast and fully local text extraction with a flexible OCR system that has a built-in sensible default with test and uh can also be extended through HTTP servers.It can generate screenshot and uh it does produce multiple output formats such as JSON and ...
智谱开源OCR!测完我把手机里的扫描软件都卸了......
量子位· 2026-02-11 12:49
Core Insights - The article discusses the capabilities and performance of the GLM-OCR model, highlighting its competitive edge in the OCR technology landscape, particularly in complex scenarios like handwriting and table recognition [1][39]. Performance Comparison - GLM-OCR outperforms several competitors in various OCR tasks, achieving a document parsing accuracy of 94.6% on OmniDocBench V1.5, surpassing PaddleOCR and others [2]. - In text recognition, GLM-OCR achieves 94.0% accuracy, significantly higher than some competitors like Deepseek-OCR2, which only reaches 34.7% [2]. - For formula recognition, GLM-OCR scores 96.5%, indicating strong performance in recognizing mathematical expressions [2]. - The model also excels in table recognition, with an accuracy of 85.2% on PubTabNet, outperforming many alternatives [2]. Practical Applications - GLM-OCR is particularly effective for structured documents such as Word, PPT, and academic papers, as well as for recognizing clear handwriting, receipts, and scanned contracts [3][4]. - The model demonstrates strong capabilities in recognizing handwritten forms, achieving an accuracy of 86.1% [4]. - It can accurately extract information from various documents, including meeting minutes and whiteboard notes, making it suitable for everyday work scenarios [3][4]. User Experience - Users report a generally positive experience with GLM-OCR in standard document parsing tasks, although challenges remain with unclear handwriting and complex layouts [4][12]. - The model's ability to handle low-quality inputs is commendable, with a recognition accuracy of around 96% for mixed content, although some errors were noted in specific cases [13][29]. Structural Extraction - GLM-OCR is capable of structured information extraction, producing outputs in standard JSON format from various documents, which is beneficial for applications like invoicing and identification [36][38]. - The model's performance in structured extraction improves significantly when clear prompts are provided, indicating its adaptability to user requirements [38]. Industry Trends - The OCR technology market is rapidly evolving, with new models like GLM-OCR emerging to meet increasing demands for efficiency and accuracy [39][40]. - The trend towards smaller model parameters (0.07B to 0.9B) is making deployment easier and more cost-effective for users [51]. - Enhanced output quality and reduced processing times are becoming standard expectations in the OCR industry, benefiting users across various sectors [51].
昂立教育:目前已形成业财务资税一体化的数据流转体系
Zheng Quan Ri Bao Wang· 2026-01-28 12:14
Core Viewpoint - The company has completed the construction of a financial shared service center in early 2020 and is continuously advancing its financial digital transformation, establishing an integrated data flow system for finance, business, and taxation [1] Group 1: Financial Transformation - The company has implemented technologies such as OCR, RPA, and AI to enhance operational efficiency and productivity [1] - The financial digital transformation is aimed at improving human efficiency and overall effectiveness [1]
DeepSeek开源全新OCR模型!弃用CLIP改用Qwen轻量小模型,性能媲美Gemini-3 Pro
量子位· 2026-01-27 08:32
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR 2, which focuses on accurately converting PDF documents to Markdown format [1] - The model's key breakthrough is the dynamic rearrangement of visual tokens based on image semantics, moving away from traditional raster scanning logic [2][3] - DeepSeek-OCR 2 achieves performance comparable to Gemini-3 Pro while utilizing a lightweight model [4] Model Architecture - DeepSeek-OCR 2 retains the classic architecture of its predecessor, consisting of an encoder and decoder working in tandem [10] - The encoder, now called DeepEncoder V2, replaces the previous CLIP component with a lightweight language model (Qwen2-0.5B), introducing causal reasoning capabilities [2][13] - This upgrade allows for intelligent rearrangement of visual tokens before they enter the main decoder, simulating human reading logic [3][15] Performance Metrics - On the OmniDocBench v1.5 benchmark, DeepSeek-OCR 2 achieved a performance score of 91.09%, representing a 3.73% improvement over the baseline [5][35] - The model's document parsing edit distance improved from 0.085 to 0.057, demonstrating the effectiveness of the visual information rearrangement [36] - In a similar token budget (1120), DeepSeek-OCR 2 outperformed Gemini-3 Pro in document parsing edit distance [37] Training and Evaluation - The training process for DeepSeek-OCR 2 follows a three-stage pipeline, focusing on semantic rearrangement and autoregressive inference [31] - The model was evaluated on a dataset comprising 1355 pages across various document types, ensuring a comprehensive assessment of its capabilities [33][34] - The model's design allows for a stable input token count between 256 and 1120, aligning with the visual budget of Gemini-1.5 Pro [27] Conclusion - DeepSeek-OCR 2 demonstrates significant advancements in OCR technology, validating the use of language model architecture as a visual encoder and paving the way for unified omni-modal encoders [39]
日照港:公司正在研究推进财务共享中心建设
Group 1 - The company is researching the establishment of a financial shared service center to enhance operational efficiency and data governance [1] - The initiative includes the introduction of technologies such as OCR (Optical Character Recognition) and RPA (Robotic Process Automation) to facilitate standardized data collection and automated workflows [1] - The goal is to improve business collaboration efficiency and the overall level of data management [1]
X @子布
子布· 2025-11-27 12:51
如何用bloom的刮刀赚200倍链接:https://t.co/Vh14rqhScu昨晚Monad链上的$chog上线,官推@ChogNFT于23:46:17发推文,推文的图片中带有CA2秒后即23:46:19总共有8个地址成功买入,其中有三个地址的盈利30-40万U,分析发现这8个地址都是用的合约0x7d1e3d2858829fd28Bd1F6eBc6F2b0945390c12B进行交易,这个合约地址正是bloom的Monad链地址。bloom的刮刀现在支持OCR,这8个地址正是用了bloom刮刀的OCR功能才能第一时间狙击成功。那么如何配置呢?1、打开bloom,链接:https://t.co/Vh14rqhScu2、选择菜单栏的【/twitter】3、点击【新增配置】4、点击【Twitter OCR】使其变绿,这项就是用来识别图片中的CA其他配置按需求修改即可 ...
New DeepSeek just did something crazy...
Matthew Berman· 2025-10-22 17:15
Deepseek OCR Key Features - Deepseek OCR is a novel approach to image recognition that compresses text by 10x while maintaining 97% accuracy [2] - The model uses a vision language model (VLM) to compress text into an image, allowing for 10 times more text in the same token budget [6][11] - The method achieves 96%+ OCR decoding precision at 9-10x text compression, 90% at 10-12x compression, and 60% at 20x compression [13] Technical Details - The model splits the input image into 16x16 patches [9] - It uses SAM, an 80 million parameter model, to look for local details [10] - It uses CLIP, a 300 million parameter model, to store information about how to put the images together [10] - The output is decoded by Deepseek 3B, a 3 billion parameter mixture of experts model with 570 million active parameters [10] Training Data - The model was trained on 30 million pages of diverse PDF data covering approximately 100 languages from the internet [21] - Chinese and English account for approximately 25 million pages, and other languages account for 5 million pages [21] Potential Impact - This technology could potentially 10x the context window of large language models [20] - Andre Carpathy suggests that pixels might be better inputs to LLMs than text tokens [17] - An entire encyclopedia could be compressed into a single high-resolution image [20]
10倍压缩率、97%解码精度!DeepSeek开源新模型 为何赢得海内外关注
Xin Lang Cai Jing· 2025-10-21 23:26
Core Insights - DeepSeek has open-sourced a new model called DeepSeek-OCR, which utilizes visual patterns for context compression, aiming to reduce computational costs associated with large models [1][3][6] Model Architecture - DeepSeek-OCR consists of two main components: DeepEncoder, a visual encoder designed for high compression and high-resolution document processing, and DeepSeek3B-MoE, a lightweight language decoder [3][4] - The DeepEncoder integrates two established visual model architectures: SAM (Segment Anything Model) for local detail processing and CLIP (Contrastive Language–Image Pre-training) for capturing global knowledge [4][6] Performance and Capabilities - The model demonstrates strong "deep parsing" abilities, capable of recognizing complex visual elements such as charts and chemical formulas, thus expanding its application in fields like finance, research, and education [6][7] - Experimental results indicate that when the number of text tokens is within ten times that of visual tokens (compression ratio <10×), the model achieves 97% OCR accuracy, maintaining around 60% accuracy even at a 20× compression ratio [6][7][8] Industry Reception - The model has received widespread acclaim from tech media and industry experts, with notable figures like Andrej Karpathy praising its innovative approach to using pixels as input for large language models [3][4] - Elon Musk commented on the long-term potential of AI models primarily utilizing photon-based inputs, indicating a shift in how data may be processed in the future [4] Practical Applications - DeepSeek-OCR is positioned as a highly practical model capable of generating large-scale pre-training data, with a single A100-40G GPU able to produce over 200,000 pages of training data daily [7][8] - The model's unique approach allows it to compress a 1000-word article into just 100 visual tokens, showcasing its efficiency in processing and recognizing text [8]
全球OCR最强模型仅0.9B!百度文心衍生模型刚刚横扫4项SOTA
量子位· 2025-10-17 09:45
Core Insights - The article highlights the launch of Baidu's new self-developed multi-modal document parsing model, PaddleOCR-VL, which achieved a score of 92.6 on the OmniDocBench V1.5 leaderboard, ranking first globally in comprehensive performance [1][11]. Model Performance - PaddleOCR-VL, with a parameter count of 0.9 billion, excels in four core capabilities: text recognition, formula recognition, table understanding, and reading order, achieving state-of-the-art (SOTA) results in all dimensions [3][12][13]. - The model supports 109 languages and maintains high recognition accuracy even in complex formats, breaking traditional OCR limitations [14][16]. Technical Specifications - The model is designed for complex document structure parsing, capable of understanding logical structures, table relationships, and mathematical expressions in documents [5][6]. - PaddleOCR-VL utilizes a two-stage architecture for document layout analysis and fine-grained recognition, enhancing stability and efficiency in handling complex layouts [36][37]. Industry Impact - PaddleOCR-VL is positioned as a critical tool in various industries, including finance, education, and public services, facilitating digital transformation and process automation [51][52]. - The model's capabilities allow it to serve as a "document work assistant," integrating seamlessly into workflows to improve efficiency and reduce costs [52][56]. Competitive Landscape - The model's performance challenges the notion that only large models can achieve high effectiveness, demonstrating that a well-structured, focused model can outperform larger counterparts in practical applications [48][49]. - PaddleOCR-VL represents a significant advancement in Baidu's multi-modal intelligence strategy, marking a milestone in the global document parsing landscape [57][58].
新股前瞻|扫描全能王不断贡献增量,被AI“香”到了的合合信息(688615.SH)赴港“再拼一把”?
智通财经网· 2025-07-10 10:51
Core Viewpoint - Company Hohhot Information is expanding its market presence by applying for a listing on the Hong Kong Stock Exchange after successfully launching on the Shanghai Stock Exchange's Sci-Tech Innovation Board in September 2022. The company's flagship product, "Scan All-in-One," has significantly contributed to its revenue growth and market position in the AI sector [1][2][11]. Financial Performance - The total revenue of Hohhot Information for the years 2022 to 2024 is projected to be RMB 988.46 million, RMB 1.186 billion, and RMB 1.438 billion respectively. In Q1 2023, the revenue reached RMB 395 million, marking a 20% increase from RMB 327 million in the same period last year [2][5]. - The revenue contribution from the "Scan All-in-One" product is expected to rise from 72.3% in 2022 to 81.1% in Q1 2024, indicating its dominance in the company's product lineup [1][3]. Product Breakdown - The revenue from C-end products, including "Scan All-in-One," "Business Card All-in-One," and "Qixinbao," accounted for 82.2%, 84.3%, and 83.8% of total revenue from 2022 to 2024. In Q1 2023, this segment's contribution increased to 86.6% [3][5]. - The "Scan All-in-One" product's revenue is projected to grow from RMB 714.57 million in 2022 to RMB 1.112 billion in 2024, while the "Business Card All-in-One" and "Qixinbao" products have shown mixed performance [3][5]. Market Position and Competitive Advantage - Hohhot Information ranks first in China and fifth globally in terms of revenue from C-end efficiency AI products with over 100 million monthly active users [1][2]. - The company has leveraged advancements in AI, particularly in optical character recognition (OCR) technology, achieving a character recognition rate of 81.9% in complex scenarios, which is significantly higher than competitors [7][8]. Future Growth Potential - The company plans to enhance its product offerings by developing new features such as RPA (Robotic Process Automation) and expanding its global market presence, particularly in emerging markets like Brazil, Indonesia, and Mexico, where the conversion rate is currently below one-fourth of that in China [10][11]. - Hohhot Information aims to establish a global technical support center and expand its marketing network to improve brand influence and customer service [10].