Causal Reasoning
Search documents
DeepSeek开源全新OCR模型!弃用CLIP改用Qwen轻量小模型,性能媲美Gemini-3 Pro
量子位· 2026-01-27 08:32
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR 2, which focuses on accurately converting PDF documents to Markdown format [1] - The model's key breakthrough is the dynamic rearrangement of visual tokens based on image semantics, moving away from traditional raster scanning logic [2][3] - DeepSeek-OCR 2 achieves performance comparable to Gemini-3 Pro while utilizing a lightweight model [4] Model Architecture - DeepSeek-OCR 2 retains the classic architecture of its predecessor, consisting of an encoder and decoder working in tandem [10] - The encoder, now called DeepEncoder V2, replaces the previous CLIP component with a lightweight language model (Qwen2-0.5B), introducing causal reasoning capabilities [2][13] - This upgrade allows for intelligent rearrangement of visual tokens before they enter the main decoder, simulating human reading logic [3][15] Performance Metrics - On the OmniDocBench v1.5 benchmark, DeepSeek-OCR 2 achieved a performance score of 91.09%, representing a 3.73% improvement over the baseline [5][35] - The model's document parsing edit distance improved from 0.085 to 0.057, demonstrating the effectiveness of the visual information rearrangement [36] - In a similar token budget (1120), DeepSeek-OCR 2 outperformed Gemini-3 Pro in document parsing edit distance [37] Training and Evaluation - The training process for DeepSeek-OCR 2 follows a three-stage pipeline, focusing on semantic rearrangement and autoregressive inference [31] - The model was evaluated on a dataset comprising 1355 pages across various document types, ensuring a comprehensive assessment of its capabilities [33][34] - The model's design allows for a stable input token count between 256 and 1120, aligning with the visual budget of Gemini-1.5 Pro [27] Conclusion - DeepSeek-OCR 2 demonstrates significant advancements in OCR technology, validating the use of language model architecture as a visual encoder and paving the way for unified omni-modal encoders [39]
X @Nick Szabo
Nick Szabo· 2025-12-19 03:31
RT Alex Prompter (@alex_prompter)This paper from Harvard and MIT quietly answers the most important AI question nobody benchmarks properly:Can LLMs actually discover science, or are they just good at talking about it?The paper is called “Evaluating Large Language Models in Scientific Discovery”, and instead of asking models trivia questions, it tests something much harder:Can models form hypotheses, design experiments, interpret results, and update beliefs like real scientists?Here’s what the authors did di ...