Workflow
CLIP
icon
Search documents
DeepSeek开源全新OCR模型!弃用CLIP改用Qwen轻量小模型,性能媲美Gemini-3 Pro
量子位· 2026-01-27 08:32
henry 发自 凹非寺 量子位 | 公众号 QbitAI 刚刚,DeepSeek开源了全新的OCR模型—— DeepSeek-OCR 2 ,主打将PDF文档精准转换Markdown。 相较于去年10月20日发布的初代模型,DeepSeek-OCR 2的核心突破在于打破了传统模型死板的"光栅扫描"逻辑,实现了 根据图像语义动态 重排视觉标记(Visual Tokens) 。 为此,DeepSeek-OCR 2弃用了前作中的CLIP组件,转而使用轻量化的语言模型(Qwen2-0.5B)构建 DeepEncoder V2 ,在视觉编码阶 段就引入了"因果推理"能力。 这一调整模拟了人类阅读文档时的因果视觉流,使LLM在进行内容解读之前,智能地重排视觉标记。 性能上,DeepSeek-OCR 2在仅采用轻量模型的前提下,达到了媲美Gemini-3 Pro的效果。 在OmniDocBench v1.5基准上,DeepSeek-OCR 2提升了 3.73% ,并在视觉阅读逻辑方面取得了显著进展。 | Model | | | | V-token™ax Overall ↑ Formula OM ↑ TableTEDs ↑ ...
零样本&少样本横扫12个工业医疗数据集:西门子×腾讯优图新研究精准定位缺陷,检测精度新SOTA丨AAAI 2026
量子位· 2026-01-19 03:48
Core Insights - The article discusses the development of AdaptCLIP, a universal visual anomaly detection framework that aims to improve performance in industrial quality inspection and medical imaging by leveraging the capabilities of the CLIP model while addressing its limitations in zero-shot and few-shot scenarios [2][4]. Group 1: Challenges in Anomaly Detection - Traditional models for defect detection require extensive labeled data, making them less effective in real-world scenarios where data is scarce [1][3]. - The core challenge in anomaly detection is the need for models to generalize across domains while accurately identifying subtle anomalies with minimal target domain data [3][4]. Group 2: AdaptCLIP Framework - AdaptCLIP introduces a lightweight adaptation approach by adding three adapters to the CLIP model without altering its core structure, enabling it to perform both image-level anomaly classification and pixel-level anomaly segmentation [5][6]. - The framework employs an alternating learning strategy, optimizing visual and textual representations separately to enhance performance in zero-shot anomaly detection [20][21]. Group 3: Key Innovations - The visual adapter fine-tunes CLIP's output tokens to better align with the anomaly detection task, significantly improving pixel-level localization capabilities [15][18]. - The text adapter eliminates the need for manually designed prompts by learning optimized embeddings for "normal" and "anomalous" classes, thus reducing dependency on prompt engineering [16][18]. Group 4: Experimental Results - AdaptCLIP achieved an average image-level AUROC of 86.2% across multiple industrial datasets in zero-shot scenarios, outperforming existing methods [31]. - In medical imaging tasks, AdaptCLIP demonstrated an average pixel-level AUPR of 48.7% and an average image-level AUROC of 90.7%, indicating superior performance compared to other approaches [31][32]. Group 5: Efficiency and Scalability - The model introduces approximately 0.6 million additional trainable parameters under zero-shot conditions, significantly lower than competing methods that can exceed 10.7 million parameters [32][37]. - AdaptCLIP maintains a reasonable inference time of about 162 ms per image at a resolution of 518x518, balancing detection accuracy with deployment efficiency [32][37].
他们认识香蕉也认识黄色,却不知道香蕉是黄色的
3 6 Ke· 2026-01-16 07:25
Core Insights - The research conducted by teams from Peking University and Shanxi Medical University reveals that language significantly influences visual perception and knowledge storage in the brain, particularly in individuals with certain neurological conditions [1][5][10]. Group 1: Visual and Language Interaction - Individuals with intact visual function but impaired connections between the visual cortex and language areas struggle to identify colors from grayscale images, indicating that language is crucial for extracting visual knowledge [3][4]. - Blind individuals acquire color knowledge primarily through language, as they lack visual experiences, contrasting with sighted individuals who utilize both visual and linguistic systems for color representation [2][9]. Group 2: AI and Cognitive Research - The study utilized AI models to differentiate the effects of visual and linguistic inputs on perception, demonstrating that language training in AI can mirror human brain activity related to visual processing [7][9]. - The research indicates that language can profoundly affect cognitive processes, challenging the notion that language only influences higher-level cognition and suggesting it also impacts basic sensory perception [10][12]. Group 3: Implications for Cognitive Science - The findings suggest that language is not merely a communication tool but a powerful system that shapes how humans abstract and organize information, potentially altering sensory experiences [12]. - The interplay between cognitive science and AI research is highlighted, as both fields can inform and enhance understanding of human cognition and perception [12].
穷人福音,MIT研究:不用堆显卡,抄顶级模型作业就成
3 6 Ke· 2026-01-09 13:20
Core Insights - The study from MIT reveals that despite the diverse architectures of AI models, their understanding of matter converges as they become more powerful, suggesting a shared cognitive alignment towards physical truths [1][2][3] Group 1: Model Performance and Understanding - The research indicates that as AI models improve in predicting molecular energy, their cognitive approaches become increasingly similar, demonstrating a phenomenon known as representation alignment [3][5] - High-performance models, regardless of their structural differences, compress their feature space to capture essential physical information, indicating a convergence in understanding [5][6] Group 2: Cross-Architecture Alignment - The study highlights that models trained on different modalities, such as text and images, also show a tendency to align in their understanding of concepts, exemplified by the representation of "cats" [9][14] - This alignment suggests that powerful models, regardless of their input type, gravitate towards a unified internal representation of reality [14] Group 3: Implications for AI Development - The findings challenge the necessity of expensive computational resources for training large models, advocating for model distillation where smaller models can mimic the cognitive processes of larger, high-performance models [18][20] - The research emphasizes that the future of scientific AI will focus on achieving convergence in understanding rather than merely increasing model complexity, leading to more efficient and innovative AI solutions [22][24][25]
为什么Agent总是Demo猛如龙实战一条虫?
量子位· 2025-12-22 09:30
Core Viewpoint - The article discusses the limitations of AI agents in real-world applications compared to their impressive demonstrations, emphasizing that adaptability is a key factor for improvement [1]. Summary by Sections Definition and Functionality of Agents - Agents are defined as AI systems that can plan, utilize tools (such as search engines and databases), and remember information to complete complex tasks independently [3]. Adaptability Framework - The core bottleneck in current agent systems is adaptability, specifically how models adjust their behavior based on feedback signals [6]. - A 2x2 classification framework is proposed to categorize existing adaptation methods into four paradigms based on two dimensions: who is optimized (the agent or the tools) and where the feedback signal comes from (tool execution results or agent output evaluations) [7][8][9]. Four Paradigms of Adaptation - **A1 Paradigm**: Agents learn from feedback based on tool execution, such as whether code runs successfully [10]. - **A2 Paradigm**: Uses the agent's final output as the optimization signal, exemplified by models like DeepSeek-R1 that train reasoning capabilities through reinforcement learning [11]. - **T1 Paradigm**: Tools are pre-trained independently and then called by the agent, allowing for plug-and-play functionality [12]. - **T2 Paradigm**: Tools optimize themselves based on the agent's output, creating a symbiotic relationship [13]. Benefits of Classification - This classification helps developers avoid trial and error when improving AI capabilities, allowing for targeted adaptations based on specific needs [15]. - It also clarifies trade-offs: modifying AI (A1/A2) is flexible but costly, while modifying tools (T1/T2) is cheaper but limited by the AI's inherent capabilities [16]. Key Findings on Data Efficiency - The T2 paradigm demonstrates significantly higher data efficiency compared to the A2 paradigm. For instance, the Search-R1 using A2 requires approximately 170,000 training samples, while T2 only needs 2,400 samples, achieving comparable results [18][19][20]. Frontiers in Adaptability Research - The article identifies four cutting-edge directions for agent adaptability research: - **Co-Adaptation**: Aims for agents and tools to optimize together within the same learning cycle, presenting challenges in credit assignment [21]. - **Continual Adaptation**: Addresses the need for agents to continuously learn new skills without forgetting old ones in a changing environment [23]. - **Safe Adaptation**: Highlights concerns that large models may erode safety measures established during supervised fine-tuning, making them more vulnerable to attacks [25]. - **Efficient Adaptation**: Focuses on resource-constrained scenarios, discussing techniques like LoRA and FlashRL for efficient learning [27]. Additional Resources - The article mentions that a GitHub repository has been opened to continuously collect related papers and resources, serving as a guide for developers building agent systems [29].
1100多个模型殊途同归,指向一个「通用子空间」,柏拉图又赢一回?
机器之心· 2025-12-14 04:53
Core Insights - The importance of model architecture may exceed previous understanding, as a study from Johns Hopkins University reveals that over 1,100 different neural networks converge to a shared low-dimensional subspace, suggesting a "prior" mathematical structure that all neural networks approach [1][2][14]. Group 1: Findings and Implications - This discovery helps explain several phenomena, such as why over-parameterized models can generalize, why different initializations lead to similar representations, and the effectiveness of techniques like LoRA and weight sharing [2][14]. - The research provides empirical evidence for the existence of a universal weight subspace hypothesis, indicating that all models may converge to a common subspace, which could limit diversity and introduce inherent biases [8][14][33]. - The study suggests that shared subspaces could enable large-scale model compression, rapid adaptation to new tasks, and insights into generalization boundaries and optimization landscapes [14][15]. Group 2: Methodology and Results - The authors focused on LoRA adapters and observed the emergence of a universal subspace in the Mistral-7B model, extending the analysis to 500 Vision Transformers and 50 LLaMA3-8B models, all trained on different datasets and initializations [11][15]. - The analysis revealed that a unique shared low-rank structure exists across various tasks, with most information concentrated in 16 or fewer subspace directions, supporting the practical utility of the universal subspace [19][22]. - The universal subspace model demonstrated a 19-fold improvement in memory efficiency, as it eliminated the need to store all individual LoRA models [23]. Group 3: Theoretical Considerations - The authors propose several theoretical factors contributing to the emergence of universal subspaces, including neural networks' preference for low-frequency functions, strong inductive biases imposed by modern architectures, and the universal nature of gradient-based optimization methods [36][37].
长文本检索大突破,联通团队研发的新模型,准确率提升近两成
Sou Hu Cai Jing· 2025-12-02 20:15
Core Viewpoint - HiMo-CLIP is a new AI model developed by China Unicom's Data Science and Artificial Intelligence Research Institute, designed to improve the accuracy of image retrieval by automatically identifying key information in complex descriptions, addressing the common issue of "too much detail leading to errors" in AI processing [2][7][21]. Group 1: Model Features - HiMo-CLIP utilizes a specialized module called HiDe, which employs statistical methods to extract the most distinguishing features from similar descriptions, enhancing the model's ability to focus on key attributes [7][8]. - The model achieves an accuracy rate of 89.3%, significantly improving upon previous methods that relied on fixed templates or manual annotations [8]. - HiMo-CLIP's implementation is efficient, requiring minimal hardware resources, with only a 7% increase in inference speed on A100 GPUs, making it accessible for standard servers [10][11]. Group 2: Performance Metrics - The model incorporates a dual alignment mechanism known as MoLo loss, which ensures that both the overall semantic meaning and core feature matching are prioritized, thus preventing the "more detail, more errors" phenomenon [11][13]. - In tests on the MSCOCO-Long dataset, HiMo-CLIP's mean Average Precision (mAP) improved by nearly 20% compared to the previous Long-CLIP model, while maintaining 98.3% of its original performance on short text datasets like Flickr30K [13]. Group 3: Practical Applications - HiMo-CLIP has already been applied in real-world scenarios, such as enhancing product search functionalities on JD.com, where complex user descriptions led to a 27% increase in search conversion rates [14][15]. - The model is also being explored in the autonomous driving sector to interpret complex road descriptions, improving environmental recognition for vehicle systems [18]. Group 4: Future Developments - The team plans to release a multilingual version of HiMo-CLIP by Q3 2026, aiming to handle specialized terminology and foreign language descriptions more effectively [21]. - The success of HiMo-CLIP highlights the importance of simulating human cognitive logic in AI models, suggesting a potential new direction for multimodal intelligence development through structured semantic spaces [21].
NeurIPS 2025 | 上下文元学习实现不微调跨被试脑活动预测
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the development of BraInCoRL, a novel brain encoding model that utilizes meta-learning and context learning to predict brain responses from visual stimuli with minimal data requirements [3][32]. - This model addresses the limitations of traditional visual encoding models, which require extensive data collection for each individual, making them costly and difficult to implement in clinical settings [6][32]. Background and Innovation - The research highlights significant functional differences in the human higher visual cortex among individuals, necessitating the creation of brain encoding models that can effectively represent these differences [2][6]. - BraInCoRL allows for the prediction of brain responses using only a small number of example images and their corresponding brain activity data, eliminating the need for model fine-tuning [3][32]. Methodology - The BraInCoRL framework treats each voxel as an independent function mapping visual stimuli to neural responses, leveraging meta-learning and context learning to enhance data efficiency and generalization [7][10]. - During training, the model learns shared structures of visual cortex responses from multiple subjects, and during testing, it can generate a subject-specific voxel encoder using just a few image-brain response pairs [11][20]. Experimental Results - BraInCoRL demonstrates high data efficiency, achieving comparable variance explanation to models trained on thousands of images while only using 100 context images [20][22]. - The model shows robust performance across different datasets and scanning protocols, confirming its cross-device and cross-protocol generalization capabilities [22][23]. - Semantic clustering visualizations reveal clear functional organization within the visual cortex, with distinct areas for faces, scenes, and other categories [26][27]. Conclusion - BraInCoRL introduces in-context learning to computational neuroscience, creating a data-efficient, interpretable, and language-interactive framework for visual cortex encoding [32]. - This innovation significantly lowers the barriers for constructing individualized brain encoding models, paving the way for applications in clinical neuroscience and other data-limited scenarios [32].
NeurIPS 2025 Spotlight | 香港大学提出无需数据标记的ViT密集表征增强方法
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the introduction of PH-Reg, a novel method for enhancing Vision Transformers (ViTs) by removing artifacts from dense features without requiring data labeling, thus improving model performance in fine-grained tasks [2][6][19]. Group 1: Methodology - PH-Reg employs a test-time augmentation denoising strategy to eliminate artifacts from the dense features of teacher models, resulting in a student model that outputs artifact-free dense features [2][11]. - The self-distillation framework of PH-Reg allows for enhancement of the student model architecture with minimal intrusion, focusing updates on specific components while preserving the core information of the pre-trained ViT model [11][20]. - The method is designed to be plug-and-play, requiring no retraining and enabling efficient artifact removal from existing pre-trained models like CLIP and DINOv2 [19][22]. Group 2: Experimental Results - In semantic segmentation tasks across eight benchmark datasets, PH-Reg outperformed mainstream methods such as MaskCLIP and SCLIP in seven datasets, demonstrating its robustness and effectiveness [13][21]. - Specifically, the method achieved a significant improvement of 5.04% in mean Intersection over Union (mIoU) on the VOC21 dataset and 3.64% on the ADE20K dataset for the CLIP model [21]. - The training time for PH-Reg is reduced by over 58.9% compared to traditional methods, with a total training time of 9000 minutes, significantly less than the 21908 minutes required for DVT [17][22]. Group 3: Advantages - PH-Reg's core advantage lies in its independence from gradient-based neural field learning, allowing for a single-stage distillation process that minimizes storage requirements and computational resources [22]. - The method can compute all distillation targets in real-time without the need for additional storage space, contrasting with DVT's requirement of 1.4 TB for neural field feature data [22].
360开源FG-CLIP2:登顶29项全球基准测试
Yang Zi Wan Bao Wang· 2025-11-03 12:17
Core Insights - The recent launch of 360 Group's open-source visual language alignment model FG-CLIP2 has generated significant attention in the global tech community, marking a breakthrough for China in the AI foundational model sector [1][7] - FG-CLIP2 outperformed major competitors like Google's SigLIP 2 and Meta's MetaCLIP2 across 29 authoritative benchmark tests, showcasing its advanced capabilities in AI [1][6] Performance and Innovations - FG-CLIP2 represents a qualitative leap in fine-grained recognition, addressing long-standing challenges faced by traditional CLIP models in distinguishing subtle object attributes and complex spatial relationships [3][6] - The model features three fundamental innovations: a hierarchical alignment architecture for macro and micro scene understanding, a dynamic attention mechanism for efficient detail capture, and a bilingual optimization strategy for balanced understanding of Chinese and English [6][7] Industry Applications - FG-CLIP2's capabilities extend to various industries, enhancing e-commerce by enabling precise searches based on complex descriptions, thereby improving product recommendation and reducing return rates [7] - In the field of embodied intelligence, FG-CLIP2 acts as a "smart eye" for robots, allowing them to execute complex tasks in dynamic environments [7] - The model also supports AIGC content generation, content review, and security monitoring, ensuring accuracy and efficiency across multiple critical scenarios [7] Strategic Importance - The open-sourcing of FG-CLIP2 is a strategic move by 360 Group, reinforcing its commitment to building a self-sufficient AI technology ecosystem in China [7]