量子位
Search documents
DeepSeek开源全新OCR模型!弃用CLIP改用Qwen轻量小模型,性能媲美Gemini-3 Pro
量子位· 2026-01-27 08:32
henry 发自 凹非寺 量子位 | 公众号 QbitAI 刚刚,DeepSeek开源了全新的OCR模型—— DeepSeek-OCR 2 ,主打将PDF文档精准转换Markdown。 相较于去年10月20日发布的初代模型,DeepSeek-OCR 2的核心突破在于打破了传统模型死板的"光栅扫描"逻辑,实现了 根据图像语义动态 重排视觉标记(Visual Tokens) 。 为此,DeepSeek-OCR 2弃用了前作中的CLIP组件,转而使用轻量化的语言模型(Qwen2-0.5B)构建 DeepEncoder V2 ,在视觉编码阶 段就引入了"因果推理"能力。 这一调整模拟了人类阅读文档时的因果视觉流,使LLM在进行内容解读之前,智能地重排视觉标记。 性能上,DeepSeek-OCR 2在仅采用轻量模型的前提下,达到了媲美Gemini-3 Pro的效果。 在OmniDocBench v1.5基准上,DeepSeek-OCR 2提升了 3.73% ,并在视觉阅读逻辑方面取得了显著进展。 | Model | | | | V-token™ax Overall ↑ Formula OM ↑ TableTEDs ↑ ...
机器人看不清,蚂蚁给治好了
量子位· 2026-01-27 06:57
金磊 发自 杭州 量子位 | 公众号 QbitAI 天下苦机器人看不清 透明 和 反光 物体久矣。 毕竟就连小动物甚至人,有时候一个不小心,都会搞笑地撞到干净的玻璃门…… 不仅如此,若是让机器人拿起 透明的玻璃杯 、 反光的不锈钢 物体,他们也会经常出现"突然瞎了"的情况。 这一切的问题,正是出在了机器人的眼睛—— 深度相机 。 因为无论是基于结构光还是双目立体视觉的深度相机,它们的工作原理都是依赖物体表面对光线的稳定反射。 而透明材质会让光线直接穿透,高反光材质则会将光线漫反射到四面八方,导致传感器无法接收到有效的回波信号,从而产生大量缺失或错 误的深度值。 对比一下我们人类看到的场景和机器人眼中的场景,就一目了然了: 毫不夸张地说,这类让机器人睁眼瞎的问题,一直是阻碍它们安全地走进家庭、商场和医院等场景的 Big Big Big Problem! 但现在,随着一项新技术的提出,机器人的眼疾终于算是被治好了—— 蚂蚁集团的具身智能公司 蚂蚁灵波 (RobbyAnt),开源了全球看得最清楚的深度视觉模型, LingBot-Depth 。 同样是上面两个场景,我们直接来看下在LingBot-Depth加持下的效 ...
奥特曼承认OpenAI路线走偏了,以及“写代码将变得不再重要”
量子位· 2026-01-27 05:37
Core Viewpoint - AI is redefining work, technology, and education, leading to an increase in demand for software engineers rather than a decrease [4][6][7]. Group 1: AI and Software Engineering - AI will enable engineers to capture more work value, reducing time spent on coding and debugging, allowing them to focus on making systems work effectively [4][6]. - The number of software engineering jobs is expected to significantly increase, with a larger portion of global GDP being created through AI-driven methods [6][7]. - Custom software tailored for individuals or small groups will become prevalent, enhancing personal productivity [5][6]. Group 2: AI Model Development - OpenAI acknowledges past mistakes in developing the ChatGPT-5 series, which focused too much on specific capabilities at the expense of others [18][19]. - The future direction aims to return to a more balanced, general-purpose model that excels across various dimensions, including communication and expression [21][22]. - There is confidence that future models can integrate multiple strong capabilities into a single framework [23][28]. Group 3: Economic Implications of AI - AI is expected to have a deflationary effect, empowering individuals to accomplish tasks previously reserved for larger organizations, potentially reducing long-standing economic disparities [34][36]. - However, there is a cautionary note that AI could also concentrate power and wealth in the hands of a few, depending on how it is deployed and regulated [37][38]. Group 4: AI in Education - AI's role in early childhood education is questioned, with a belief that technology should not be introduced at such a formative stage [14][16]. - The long-term impacts of technology on youth development remain unclear, necessitating careful consideration before integrating AI into educational settings [15][16]. Group 5: AI and Attention Economy - Despite advancements in AI making software development easier, the challenge of capturing human attention and creating meaningful connections with products remains significant [43][45]. - The scarcity of human attention in a world of abundant software capabilities means that creating exceptional value is still essential for entrepreneurial success [46].
3D版Nano Banana来了!AI修模成为现实,3D生成进入可编辑时代
量子位· 2026-01-27 03:53
Core Viewpoint - The article highlights the emergence of 3D generation technology as a critical area in AI, with significant advancements led by the Chinese team Hyper3D, particularly through their product Rodin Gen-2 Edit, which integrates 3D generation and editing capabilities [1][3][27]. Group 1: 3D Generation and Editing Technology - Hyper3D has launched Rodin Gen-2 Edit, the first commercial product that combines "3D generation" and "3D editing" into a complete workflow, marking the entry of 3D generation into the editable era [3][11]. - The editing functionality allows users to select specific areas of a model and input text commands for modifications, such as changing a robot's arms to cannons, demonstrating a user-friendly approach to 3D model editing [4][5][20]. - The platform supports importing any existing models, including third-party AI-generated models, for editing, establishing Hyper3D's editing capabilities as a foundational infrastructure rather than a standalone feature [9][11]. Group 2: Technological Advancements and User Experience - Hyper3D Rodin showcases cutting-edge technology, enabling users to modify, add, or remove model components through natural language without affecting the overall structure, thus revolutionizing 3D modeling [13][21]. - The transition from "generation" to "editing" fills a crucial gap in the AI workflow, allowing for iterative design processes rather than random generation, which has been common in the past [14][19]. - The platform's capabilities are enhanced by the introduction of 3D ControlNet, which allows precise control over geometric structures during the generation phase, and the BANG technology, which facilitates recursive disassembly of complex models for localized editing [17][25]. Group 3: Market Position and Future Directions - Hyper3D's advancements have been recognized by the market, with the team completing two rounds of funding from top-tier VC and strategic industry players in 2025, indicating strong investor confidence in their technology [27]. - The company aims to extend beyond single-object editing, with future developments targeting the creation of complete 3D scenes that include objects, relationships, and physical constraints, laying the groundwork for future "world models" and embodied intelligence infrastructure [26]. - The launch of Rodin Gen-2 Edit represents a significant step in making 3D generation not just feasible but practically usable, providing a valuable reference point for the industry [27].
斯坦福英伟达推出测试时强化学习:微调开源模型胜过顶级闭源模型,仅需几百美元
量子位· 2026-01-27 02:33
Core Insights - The article discusses a new approach called Test-Time Training to Discover (TTT-Discover), which aims to solve open scientific problems by incorporating reinforcement learning during the testing phase of model evaluation [1][2]. Group 1: Methodology - TTT-Discover is based on the open-source model gpt-oss-120b and achieves state-of-the-art (SOTA) performance across multiple domains, outperforming human experts and closed-source models [3]. - Unlike traditional methods that rely on "Test-time Scaling" through prompt scheduling, TTT-Discover updates model weights during the testing phase to learn from specific problems [4][5]. - This "test-time training" allows the model to gain real-time experience from failed attempts, leading to a directed evolution of its capabilities [6]. Group 2: Learning Objectives - TTT-Discover employs an Entropic Objective, which focuses on maximizing the reward for the best actions rather than average rewards across all tasks, aiming for a single optimal solution instead of multiple mediocre ones [9][10][11]. - The method introduces a reuse mechanism inspired by PUCT, maintaining historical attempts in a buffer to prioritize the most promising states while balancing exploration [12]. Group 3: Implementation and Results - The model generates a "private dataset" through continuous action generation and feedback reception, addressing the out-of-distribution (OOD) problem by creating data specific to the problem at hand [13][14]. - TTT-Discover's approach contrasts with traditional test-time search methods, which do not update model weights and thus do not enhance the model's capabilities [15][16]. - The algorithm involves a cycle of selecting potential solutions, generating new attempts, and evaluating results, with the model's weights updated after each iteration to improve performance [17][18][27]. Group 4: Performance Metrics - In experimental settings, TTT-Discover demonstrated a speed improvement of approximately 2 times compared to the best human implementations in kernel engineering tasks [27]. - The testing cost for a single problem is estimated to be several hundred dollars, showcasing the efficiency of the approach [27]. Group 5: Future Directions - TTT-Discover is primarily applicable to continuous reward scenarios, with future work needed to extend its capabilities to sparse, binary, and unverifiable reward problems [29].
量子位编辑作者招聘
量子位· 2026-01-27 02:33
我们是一家以 追踪AI新进展 为核心的内容平台,经过8年积累,目前拥有顶流影响力,广泛且备受认可的产业资源,以及时代风口的最佳观 测和学习生态位。 AI热潮还在汹涌,但如果你还不知道如何参与……那为什么不来 量子位 呢? 编辑部 发自 凹非寺 量子位 | 公众号 QbitAI 目前,我们有 三大方向 岗位招聘,希望你是 (或者能成为) 这三个方向的内容专家: 岗位均为全职,工作地点:北京中关村。 岗位面向: 加入我们,你可以获得: 以下是岗位详情: 所有岗位不同能力层级职位均在开放,欢迎结合个人履历和经验申请。 AI产业方向 岗位职责: AI产业方向 :关注基建层创新,包含芯片、AI Infra、云计算; AI财经方向 :关注AI领域创投和财报,跟踪产业链资本动向; AI产品方向 :关注AI在应用和硬件终端方向的进展。 社招:覆盖编辑、主笔、主编各个层级,按能力匹配岗位; 校招:应届毕业生,接受实习且可转正。 站在AI浪潮之巅 :第一时间接触和了解AI领域最新技术和产品,构建完整的AI认知体系。 玩转AI新工具 :将各种AI新技术、新工具应用于工作,提升工作效率和创造力。 打造个人影响力 :通过撰写独家原创内 ...
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正丨上大×南开
量子位· 2026-01-27 02:33
Core Insights - The article discusses the reliability of attention mechanisms in Vision-Language Models (VLMs), highlighting that attention may not be a trustworthy indicator of semantic importance due to structural biases [2][12] Group 1: Attention Mechanism Issues - Attention is influenced by structural biases, such as position bias, which favors later tokens in a sequence, leading to potential misinterpretation during visual token pruning [3][5] - The phenomenon of "padding attention sink" is identified, where padding areas receive disproportionately high attention, misleading pruning strategies [5][6] Group 2: Proposed Solutions - The research team from Shanghai University suggests a debiasing approach to correct attention biases without introducing new pruning methods or additional training processes [6][12] - By modeling the overall trends of attention biases, the team effectively reduces irrelevant positional factors, enhancing the semantic relevance of attention [6][12] Group 3: Experimental Results - The debiasing strategy was integrated as a plug-and-play module into various mainstream attention-based visual token pruning methods, showing consistent performance improvements across multiple tasks [7][10] - Experimental results indicate that the pruning models with the debiasing correction achieved stable performance enhancements, particularly under aggressive token compression conditions [10][12] Group 4: Conclusion - The findings emphasize that attention is not inherently equivalent to semantic importance, and ignoring inherent structural biases can mislead pruning strategies, affecting overall model performance [12]
11.77亿资本押注卡车新势力「一哥」,L2升维路线率先在商用车跑通!
量子位· 2026-01-27 02:33
贾浩楠 发自 凹非寺 量子位 | 公众号 QbitAI 被公认技术门槛高、商业化挑战大的硬核赛道,也总有玩家能够逆周期成长—— 本轮融资的投资方,包含了普华资本、ABC Impact(淡马锡旗下投资公司)、欣旺达、前海淏天、瀚棠置业、临沂国科、长兴创强基金、山 东国控资本、联想创投、大湾区基金、光跃投资、红山基金。国资、外资、产业资本齐上阵,抓紧窗口赶上了DeepWay深向上市的末班车。 融资朋友圈不断"扩容",在DeepWay深向是常态,比如过去5年中,DeepWay深向曾在A轮融了5轮,B轮3轮……目前公开可查的金额为19.8 亿元,加上这次的11.77亿,已经超30亿。 这一方面证明了这家自动驾驶卡车公司的"抢手"程度。 5年间,DeepWay深向靠卖新能源重卡实现年营收数十亿元;加速度更加不容忽视,刚刚过去的2025年单季度交付量就超过2024全年—— 但投资人看好的,不仅是卖车的业绩,毕竟这一逻辑还无法支撑DeepWay深向 公开道路场景下的"自动驾驶卡车第一股" 的估值和潜力。 正向定义打造新能源重卡, 当年百度商用车领域唯一获授权使用百度Apollo技术 、新能源重卡三电全栈自研…… 穿透Dee ...
那个用半成品刷爆SOTA的Qwen3超大杯推理版,现在正式上线
量子位· 2026-01-26 15:30
Core Viewpoint - The article highlights the launch of Qwen3-Max-Thinking by Alibaba Qwen, which has achieved state-of-the-art (SOTA) performance in various benchmark tests, surpassing leading models like GPT-5.2-Thinking and Claude-Opus-4.5 in multiple categories [1][2]. Group 1: Model Performance - Qwen3-Max-Thinking has demonstrated superior performance in 19 authoritative benchmark tests, achieving scores that match or exceed those of top closed-source models [1]. - In the MMLU-Pro benchmark, Qwen3-Max-Thinking scored 85.7, while GPT-5.2-Thinking scored 87.4, and Claude-Opus-4.5 scored 89.5 [2]. - The model's reasoning capabilities were highlighted, achieving a score of 91.5 in the IMO-AnswerBench, the highest among competitors [31]. Group 2: Technical Innovations - Qwen3-Max-Thinking incorporates two key innovations: adaptive tool invocation and test-time scaling, which significantly enhance its reasoning performance and native agent capabilities [3][19]. - The adaptive tool invocation allows the model to autonomously select and utilize built-in functions such as search and code interpreters during interactions, improving efficiency [22][24]. - Test-time scaling allocates additional computational resources during the reasoning phase, leading to improved performance without unnecessary redundancy [27][30]. Group 3: Market Impact and Adoption - The article notes that Chinese open-source AI models have gained significant traction, with a 17.1% adoption rate in global model downloads, surpassing the U.S. at 15.8% [36]. - Alibaba's Qwen series has achieved over 10 billion downloads, averaging 1.1 million downloads per day, establishing itself as a new benchmark in the global AI open-source community [39]. - The integration of Qwen models into Alibaba's ecosystem, including platforms like Taobao and Alipay, indicates a strategic focus on combining top-tier model capabilities with practical applications [42][43].
瑞幸背后的芯片,藏不住了
量子位· 2026-01-26 10:14
Core Viewpoint - The article discusses the significant role of edge AI and the importance of chips in the operations of Luckin Coffee, revealing the partnership with a newly listed domestic GPU company, TianShu ZhiXin [8][35]. Group 1: Edge AI and Chip Importance - Luckin Coffee utilizes edge AI to monitor various operational aspects such as order recognition, material status, and equipment performance, ensuring real-time data synchronization for quality control and decision-making [3][4]. - The chips are crucial for deploying edge AI, requiring proximity for computation, quick response times, strong stability, and cost control [6][7]. Group 2: TianShu ZhiXin and Product Launch - TianShu ZhiXin recently launched four edge computing products under the Tongyang series, which are already in use by Luckin Coffee [9][10]. - The Tongyang series includes four products: TY1000, TY1100, TY1100_NX, and TY1200, designed to cater to various computational needs and deployment scenarios [16][29]. Group 3: Product Specifications and Performance - The TY1000 model is compact yet powerful, offering nearly 200T of dense computing power and outperforming NVIDIA's AGX Orin in several benchmarks [18][20]. - The TY1100 features a 12-core ARM v9 architecture, suitable for complex scenarios requiring high general computing and AI inference [22][24]. - The TY1100_NX is designed for users sensitive to memory capacity and cost, while the TY1200 targets end-users looking to integrate AI capabilities directly into devices [26][28]. Group 4: Market Position and Ambitions - TianShu ZhiXin aims to surpass NVIDIA, with a roadmap indicating plans to release architectures that outperform NVIDIA's offerings by 2025 and beyond [36][39]. - The company has already delivered over 52,000 chips and serves over 300 clients, demonstrating significant commercial traction and application in various industries [49][51]. Group 5: Broader Implications - The integration of domestic computing power into various sectors signifies a shift in the industry, where chips are becoming essential components of business operations rather than mere specifications [54].