Workflow
Seek .(SKLTY)
icon
Search documents
DeepSeek-OCR:大模型技术,正站在一个新的十字路口
3 6 Ke· 2025-10-22 23:15
Core Insights - DeepSeek has introduced "DeepSeek-OCR," a model that utilizes "Context Optical Compression," significantly enhancing the efficiency of processing textual information from images [1][2][7] - The model demonstrates that images can serve as efficient carriers of information, challenging the traditional reliance on text-based processing [2][6] Group 1: Image Processing Efficiency - DeepSeek-OCR processes documents by treating text as images, compressing entire pages into a few visual tokens, achieving a tenfold efficiency increase with a 97% accuracy rate [1][2] - Traditional methods require thousands of tokens for a lengthy article, while DeepSeek-OCR only needs about 100 visual tokens, allowing it to handle long documents without resource constraints [2][3] Group 2: System Architecture and Functionality - The system consists of two modules: a powerful DeepEncoder that captures page information and a lightweight text generator that converts visual tokens into readable output [3] - The encoder combines local analysis and global understanding, reducing the initial 4096 tokens to just 256, showcasing a 90% reduction compared to competitors [3][4] - In practical tests, a single A100 GPU can process over 200,000 pages daily, with potential scalability to 33 million pages across multiple servers [3][4] Group 3: Information Density and Model Training - The paradox of image data being more efficient lies in its information density; images can encapsulate more data compactly compared to text tokens, which require extensive dimensional expansion [4][5] - While DeepSeek-OCR proves the feasibility of visual tokens, training purely visual models remains a challenge due to the ambiguity in predicting image segments [5][9] Group 4: Potential Impact and Applications - If widely adopted, this technology could transform the "token economy," significantly reducing processing costs for long documents and enhancing data extraction from complex formats [6][7] - It could also improve chatbots' long-term memory by converting old conversations into low-resolution images, simulating human memory decay while extending context without increasing token consumption [6][11] Group 5: Conclusion - The exploration of DeepSeek-OCR not only achieves a tenfold efficiency improvement but also redefines the boundaries of document processing, challenging existing limitations and optimizing cost structures [7][8]
AI赛道又卷起来了!DeepSeek开源新模型,OpenAl推出AI浏览器!科创人工智能ETF随市回调,逢跌布局时刻到?
Xin Lang Ji Jin· 2025-10-22 03:32
10月20日,国产AI公司DeepSeek宣布开源最新大模型DeepSeek-OCR,这是一种视觉-文本压缩范式,通 过用少量的视觉token来表示原本需要大量文本token的内容,以此降低大模型的计算开销。 DeepSeek开源新模型,为何赢得海内外关注?业内人士指出,传统的AI模型是直接"读"文本,但 DeepSeek-OCR 是先"看"文本的图像,然后把一页文档的图片信息高度压缩成很少的视觉tokens。 DeepSeek-OCR的能力强在能把一篇1000字的文章,压缩成100个视觉tokens。在十倍的压缩下,识别准 确率可以达到96.5%。 放眼海外,当地时间周二(10月21日),OpenAl推出AI浏览器Atlas,与谷歌Chrome正面竞争,用户可 以在任何网页上直接调用ChatGPT,实现总结内容、提问或执行任务等功能。OpenAI首席执行官萨姆· 奥尔特曼表示,AI代表着十年一遇的机会,让人们重新思考浏览器应当是什么。 东兴证券指出,人工智能行业当前处于政策、技术、需求三维共振阶段,叠加"人工智能+"行动带来自 上而下的政策赋能及潜在资金支撑,国产芯片及云计算龙头在业绩上逐步验证、大厂Cap ...
AI赛道又卷起来了!DeepSeek开源新模型,OpenAl推出AI浏览器!科创人工智能ETF随市回调,逢跌布局时刻已到
Xin Lang Ji Jin· 2025-10-22 03:32
放眼海外,当地时间周二(10月21日),OpenAl推出AI浏览器Atlas,与谷歌Chrome正面竞争,用户 可以在任何网页上直接调用ChatGPT,实现总结内容、提问或执行任务等功能。OpenAI首席执行官萨姆 ·奥尔特曼表示,AI代表着十年一遇的机会,让人们重新思考浏览器应当是什么。 政策面上,工信部公开征求对《算力标准体系建设指南(2025版)》的意见,其中提出,到2027年,围 绕基础通用、算力设施、算力设备、算网融合、算力互联、算力平台、算力应用、算力安全、绿色低碳 等方面制修订50项以上标准,有效推动算力标准体系建设。 10月20日,国产AI公司DeepSeek宣布开源最新大模型DeepSeek-OCR,这是一种视觉-文本压缩范式,通 过用少量的视觉token来表示原本需要大量文本token的内容,以此降低大模型的计算开销。 DeepSeek开源新模型,为何赢得海内外关注?业内人士指出,传统的AI模型是直接"读"文本,但 DeepSeek-OCR 是先"看"文本的图像,然后把一页文档的图片信息高度压缩成很少的视觉tokens。 DeepSeek-OCR的能力强在能把一篇1000字的文章,压缩成1 ...
DeepSeek昨天开源的新模型,有点邪门
3 6 Ke· 2025-10-22 01:00
Core Insights - DeepSeek has introduced a new model called DeepSeek-OCR, which can compress text information into images, achieving a significant reduction in token usage while maintaining high accuracy [5][31][39]. Group 1: Model Capabilities - DeepSeek-OCR can store large amounts of text as images, allowing for a more efficient representation of information compared to traditional text-based models [9][10]. - The model demonstrates a compression ratio where it can use only 100 visual tokens to outperform previous models that required 256 tokens, and it can achieve results with less than 800 visual tokens compared to over 6000 tokens used by other models [14][31]. - DeepSeek-OCR supports various resolutions and compression modes, adapting to different document complexities, with modes ranging from Tiny to Gundam, allowing for dynamic adjustments based on content [17][18]. Group 2: Data Utilization - The model can capture previously unutilized data from documents, such as graphs and images, which traditional models could not interpret effectively [24][26]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [29]. - By utilizing image memory, the model reduces the computational load significantly, allowing for a more efficient processing of longer conversations without a proportional increase in resource consumption [31]. Group 3: Open Source Collaboration - The development of DeepSeek-OCR is a collaborative effort, integrating various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [38][39]. - The model's architecture reflects a collective achievement from the open-source community, showcasing the potential of collaborative innovation in AI development [39].
10倍压缩率、97%解码精度!DeepSeek开源新模型 为何赢得海内外关注
Xin Lang Cai Jing· 2025-10-21 23:26
Core Insights - DeepSeek has open-sourced a new model called DeepSeek-OCR, which utilizes visual patterns for context compression, aiming to reduce computational costs associated with large models [1][3][6] Model Architecture - DeepSeek-OCR consists of two main components: DeepEncoder, a visual encoder designed for high compression and high-resolution document processing, and DeepSeek3B-MoE, a lightweight language decoder [3][4] - The DeepEncoder integrates two established visual model architectures: SAM (Segment Anything Model) for local detail processing and CLIP (Contrastive Language–Image Pre-training) for capturing global knowledge [4][6] Performance and Capabilities - The model demonstrates strong "deep parsing" abilities, capable of recognizing complex visual elements such as charts and chemical formulas, thus expanding its application in fields like finance, research, and education [6][7] - Experimental results indicate that when the number of text tokens is within ten times that of visual tokens (compression ratio <10×), the model achieves 97% OCR accuracy, maintaining around 60% accuracy even at a 20× compression ratio [6][7][8] Industry Reception - The model has received widespread acclaim from tech media and industry experts, with notable figures like Andrej Karpathy praising its innovative approach to using pixels as input for large language models [3][4] - Elon Musk commented on the long-term potential of AI models primarily utilizing photon-based inputs, indicating a shift in how data may be processed in the future [4] Practical Applications - DeepSeek-OCR is positioned as a highly practical model capable of generating large-scale pre-training data, with a single A100-40G GPU able to produce over 200,000 pages of training data daily [7][8] - The model's unique approach allows it to compress a 1000-word article into just 100 visual tokens, showcasing its efficiency in processing and recognizing text [8]
DeepSeek的终极野心:把大语言模型的基本语言都改造成图像
3 6 Ke· 2025-10-21 12:52
Core Insights - DeepSeek has open-sourced DeepSeek-OCR, an OCR model that achieves state-of-the-art results on benchmarks like OmniDocBench [1] - The motivation behind entering the OCR field is to address the computational bottleneck of long context processing in large language models (LLMs) [4][6] - The paper proposes that text information can be efficiently compressed through optical 2D mapping, allowing visual language models (VLMs) to decompress original information from images [4][6] Group 1: Long Context Processing - The pursuit of longer context in LLMs has led to a competitive arms race, with token windows expanding from thousands to millions [7] - The core limitation arises from the attention mechanism in the Transformer architecture, where computational complexity and memory usage grow quadratically with sequence length [7] - DeepSeek-AI's engineers propose a fundamental question: can the number of tokens be compressed rather than just optimizing attention calculations? [7][10] Group 2: Visual Tokens vs. Text Tokens - Visual tokens are the basic units of information processed by visual models, while text tokens are used by LLMs [8] - A 1024x1024 image can be divided into 4096 visual tokens, significantly reducing the number of tokens needed compared to text representation [9] - The understanding that visual modalities can serve as efficient compression mediums for text information led to the creation of DeepSeek-OCR [9] Group 3: DeepEncoder and Compression Techniques - DeepSeek-OCR is essentially a proof of concept for an "optical compression-decompression" system [10] - The DeepEncoder, a key innovation, is designed to handle high-resolution inputs while producing minimal visual tokens [11][12] - The architecture consists of three stages: a local detail processor, a compression module, and a global attention layer [14][16] Group 4: Performance Metrics - Experimental results show a 10.5x compression rate with 64 visual tokens decoding 600-700 text tokens, achieving an OCR accuracy of 96.5% [17][18] - At a 20x compression rate, the model maintains around 60% accuracy while decoding over 1200 text tokens [17][18] - DeepSeek-OCR outperforms existing models like GOT-OCR2.0 and MinerU2.0 in terms of performance and token efficiency [19][20] Group 5: Future Vision and Memory Simulation - The team aims to simulate human memory's forgetting mechanism, which naturally prioritizes relevant information while compressing less important details [25][27] - The multi-resolution design of DeepSeek-OCR provides a technical foundation for managing memory in a way that mimics human cognitive processes [29][30] - The ultimate goal is to create a system that balances information retention and computational efficiency, potentially leading to a new paradigm in AI memory and input systems [32][35]
谁家AI用一万美元赚翻了?DeepSeek第一 GPT 5垫底
Di Yi Cai Jing· 2025-10-21 12:33
这几天,各大AI社群被一场"投资直播"刷屏。网友们实时追踪六大AI模型的交易表现,讨论的热情程度甚至超过研究自己炒股,这是一场用真金白银进行的 AI投资对决。 这场由初创公司Nof1发起的"Alpha Arena"基准测试,并非模拟交易。主办方为了衡量AI投资能力,给每个模型账户发放了一万美元的启动资金,让它们在 真实市场自主交易数字货币。Alpha Arena将直播整个过程,价格会实时波动,并对实时收益进行排名,还可以看到每个模型的交易思路。 按目前盈利能力排名,参与这次比赛的有DeepSeek chat v3.1、Claude Sonnet 4.5、Grok 4、Qwen3 Max、Gemini 2.5 pro、GPT 5六个AI模型,有三家海外头部 模型和国内两家模型。这一投资交易竞赛开始于美东时间10月18日,将持续两周,于11月3日结束。 真实市场交易有趣的地方在于,市场永远有波动,是不可预测的,即便最先进的AI也无法保持稳定的收益。正如官方所说的那样,"市场是智力的终极考 验"。 目前过去了4天,已经历了一些波动。前三天,排名第一的DeepSeek收益率还一度接近40%,盈利超过4000美元,但1 ...
深度|DeepSeek-OCR引爆“语言vs像素”之争,Karpathy、马斯克站台“一切终归像素”,视觉派迎来爆发前夜
Sou Hu Cai Jing· 2025-10-21 12:25
技术内核:十倍压缩 + 多分辨率,"读"变"看"的工程路径 DeepSeek-OCR 的设计思路非常鲜明:通过多分辨率的视觉编码机制,实现极高的信息压缩效率。 模型提供了多个分辨率选项:最低的 512×512 图像仅需 64 个 token,而 1024×1024 则对应 256 个 token。对于复杂版面,它会组合多种分辨率——整页用 多个 1024×1024 的块进行全局编码,重点区域再以 640×640 的高分辨率单独处理。 这套路线的底层逻辑是:把文本先渲染成图片,再用视觉编码器把它压成更少的视觉 token。传统做法是"按字/词切片—> 变成一长串文本 token—> 塞给 LLM",而 DeepSeek 的思路是"把一页文字变成若干张多尺度图块—> 视觉编码—> 少量视觉 token"。从工程权衡的角度看,这有三层直接收益: DeepSeek 在工程上还给了多分辨率的"粗到细"路径:整页用较粗分辨率覆盖,重点区域再用更高分辨率补洞,既保全结构又兼顾要点密度。 过去一年,大模型世界像一场"算力奥运会":谁的参数更大、Bench 更高、吞吐更快,就能赢得下一轮融资与流量。 但 DeepSeek-OC ...
谁家AI用一万美元赚翻了?DeepSeek第一,GPT 5垫底
Di Yi Cai Jing· 2025-10-21 11:24
海内外AI投资大战打响,胜负未定。 这几天,各大AI社群被一场"投资直播"刷屏。网友们实时追踪六大AI模型的交易表现,讨论的热情程度甚至超过研究自己炒股,这是一场用真金白银进行的 AI投资对决。 这场由初创公司Nof1发起的"Alpha Arena"基准测试,并非模拟交易。主办方为了衡量AI投资能力,给每个模型账户发放了一万美元的启动资金,让它们在 真实市场自主交易数字货币。Alpha Arena将直播整个过程,价格会实时波动,并对实时收益进行排名,还可以看到每个模型的交易思路。 按目前盈利能力排名,参与这次比赛的有DeepSeek chat v3.1、Claude Sonnet 4.5、Grok 4、Qwen3 Max、Gemini 2.5 pro、GPT 5六个AI模型,有三家海外头部 模型和国内两家模型。这一投资交易竞赛开始于美东时间10月18日,将持续两周,于11月3日结束。 真实市场交易有趣的地方在于,市场永远有波动,是不可预测的,即便最先进的AI也无法保持稳定的收益。正如官方所说的那样,"市场是智力的终极考 验"。 目前过去了4天,已经历了一些波动。前三天,排名第一的DeepSeek收益率还一度接 ...
DeepSeek-OCR横空出世,3B参数量开启OCR新“视界”!科创人工智能ETF华夏(589010) 早盘活跃,AI主题热度延续
Mei Ri Jing Ji Xin Wen· 2025-10-21 07:36
每日经济新闻 (责任编辑:张晓波 ) 【免责声明】本文仅代表作者本人观点,与和讯网无关。和讯网站对文中陈述、观点判断保持中立,不对所包含内容 的准确性、可靠性或完整性提供任何明示或暗示的保证。请读者仅作参考,并请自行承担全部责任。邮箱: news_center@staff.hexun.com 截至9:47,科创人工智能ETF(589010) 早盘震荡上行,现报1.389元,较昨收上涨0.94%。该ETF开 盘报1.392元后快速回落,在1.38元附近获得支撑,短线呈"V"形反弹格局。成交活跃,开盘不到20分钟 成交额已达1.9亿元,显示市场交投积极。持仓股中上涨家数达26只,占比超八成,威胜信息、合合信 息、恒玄科技领涨;跌幅居前的寒武纪和优刻得表现相对疲弱。近五日ETF持续净流入,反映科创AI主 题持续受关注。 消息方面,DeepSeek-AI团队发布《DeepSeek-OCR:Contexts Optical Compression》论文,提出利 用视觉模态压缩长文本上下文的新方法。Hugging Face页面显示,该模型的参数量为3B。根据介绍,此 次开源的DeepSeek-OCR由两个部分组成:核心编 ...