Workflow
CLIP
icon
Search documents
超越CLIP,北大开源细粒度视觉识别大模型,每类识别训练仅需4张图像
3 6 Ke· 2026-02-11 08:03
Core Insights - The research team led by Professor Peng Yuxin from Peking University has made significant advancements in fine-grained visual recognition using multi-modal large models, with their latest paper accepted at ICLR 2026 and made open-source [1][19]. Group 1: Fine-Grained Visual Recognition - The real world exhibits fine-grained characteristics, with objects often containing a rich hierarchy of categories, such as the classification of aircraft into specific models like Boeing 707, 717, and 727, with over 500 types of fixed-wing aircraft recorded globally [2]. - The Fine-R1 model aims to leverage the extensive knowledge of fine-grained subcategories contained within multi-modal large models to achieve fine-grained recognition of visual objects in open domains, overcoming the limitations of traditional methods that focus on a closed set of categories [4]. Group 2: Model Development and Methodology - The Fine-R1 model employs a two-phase approach: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to enhance the model's inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves the model's robustness to intra-class variations and its ability to distinguish between different classes [8]. - The model demonstrates superior performance, achieving higher accuracy in recognizing both seen and unseen subcategories with only four training images per class, surpassing models like OpenAI's CLIP and Google's DeepMind's SigLIP [13][14]. Group 3: Experimental Results - Experimental results indicate that Fine-R1 outperforms various models in both closed-set and open-set recognition tasks, showcasing its effectiveness in fine-grained visual recognition [14][16]. - The model's enhancements are attributed primarily to its improved ability to utilize fine-grained subcategory knowledge rather than merely optimizing visual representations or increasing knowledge reserves [16].
超越CLIP!北大开源细粒度视觉识别大模型,每类识别训练仅需4张图像
量子位· 2026-02-11 01:55
Core Viewpoint - The article discusses the limitations of current multimodal large models in fine-grained visual recognition tasks and introduces the Fine-R1 model developed by Professor Peng Yuxin's team at Peking University, which significantly improves recognition accuracy with minimal training data [1][2][5]. Group 1: Fine-Grained Visual Recognition Challenges - Current multimodal large models excel in complex tasks but lag in fine-grained visual recognition compared to their visual encoders like CLIP [1]. - Real-world objects exhibit fine-grained characteristics, with numerous subclasses, such as over 500 types of fixed-wing aircraft, highlighting the importance of fine-grained recognition in practical applications [3]. Group 2: Fine-R1 Model Overview - The Fine-R1 model aims to leverage the rich knowledge of fine-grained subclasses and a generative decoding paradigm to overcome the limitations of traditional recognition methods, enabling fine-grained recognition of any visual object in an open domain [5]. - Fine-R1 enhances the model's ability to reason about unseen subclasses using a small number of training images (only 4 per subclass), outperforming models like OpenAI's CLIP and Google's DeepMind's SigLIP [5][15]. Group 3: Model Development Process - The development of Fine-R1 involves two main steps: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to build inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves robustness to intra-class variations and inter-class distinctions by using positive and negative samples [8][10]. Group 4: Experimental Results - Fine-R1's performance was evaluated on six authoritative fine-grained image classification datasets, demonstrating superior accuracy in both seen and unseen categories compared to other models [15][17]. - The model's ability to utilize fine-grained subclass knowledge effectively was identified as the primary factor for its improved recognition accuracy, rather than enhancements in visual representation or knowledge storage [19]. Group 5: Conclusion and Future Work - The article concludes with the potential of Fine-R1 to excel in fine-grained visual recognition tasks, emphasizing its innovative approach to reasoning and knowledge application [21]. - The research has been accepted for ICLR 2026 and the code is open-sourced for further exploration [2][22].
DeepSeek开源全新OCR模型!弃用CLIP改用Qwen轻量小模型,性能媲美Gemini-3 Pro
量子位· 2026-01-27 08:32
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR 2, which focuses on accurately converting PDF documents to Markdown format [1] - The model's key breakthrough is the dynamic rearrangement of visual tokens based on image semantics, moving away from traditional raster scanning logic [2][3] - DeepSeek-OCR 2 achieves performance comparable to Gemini-3 Pro while utilizing a lightweight model [4] Model Architecture - DeepSeek-OCR 2 retains the classic architecture of its predecessor, consisting of an encoder and decoder working in tandem [10] - The encoder, now called DeepEncoder V2, replaces the previous CLIP component with a lightweight language model (Qwen2-0.5B), introducing causal reasoning capabilities [2][13] - This upgrade allows for intelligent rearrangement of visual tokens before they enter the main decoder, simulating human reading logic [3][15] Performance Metrics - On the OmniDocBench v1.5 benchmark, DeepSeek-OCR 2 achieved a performance score of 91.09%, representing a 3.73% improvement over the baseline [5][35] - The model's document parsing edit distance improved from 0.085 to 0.057, demonstrating the effectiveness of the visual information rearrangement [36] - In a similar token budget (1120), DeepSeek-OCR 2 outperformed Gemini-3 Pro in document parsing edit distance [37] Training and Evaluation - The training process for DeepSeek-OCR 2 follows a three-stage pipeline, focusing on semantic rearrangement and autoregressive inference [31] - The model was evaluated on a dataset comprising 1355 pages across various document types, ensuring a comprehensive assessment of its capabilities [33][34] - The model's design allows for a stable input token count between 256 and 1120, aligning with the visual budget of Gemini-1.5 Pro [27] Conclusion - DeepSeek-OCR 2 demonstrates significant advancements in OCR technology, validating the use of language model architecture as a visual encoder and paving the way for unified omni-modal encoders [39]
零样本&少样本横扫12个工业医疗数据集:西门子×腾讯优图新研究精准定位缺陷,检测精度新SOTA丨AAAI 2026
量子位· 2026-01-19 03:48
Core Insights - The article discusses the development of AdaptCLIP, a universal visual anomaly detection framework that aims to improve performance in industrial quality inspection and medical imaging by leveraging the capabilities of the CLIP model while addressing its limitations in zero-shot and few-shot scenarios [2][4]. Group 1: Challenges in Anomaly Detection - Traditional models for defect detection require extensive labeled data, making them less effective in real-world scenarios where data is scarce [1][3]. - The core challenge in anomaly detection is the need for models to generalize across domains while accurately identifying subtle anomalies with minimal target domain data [3][4]. Group 2: AdaptCLIP Framework - AdaptCLIP introduces a lightweight adaptation approach by adding three adapters to the CLIP model without altering its core structure, enabling it to perform both image-level anomaly classification and pixel-level anomaly segmentation [5][6]. - The framework employs an alternating learning strategy, optimizing visual and textual representations separately to enhance performance in zero-shot anomaly detection [20][21]. Group 3: Key Innovations - The visual adapter fine-tunes CLIP's output tokens to better align with the anomaly detection task, significantly improving pixel-level localization capabilities [15][18]. - The text adapter eliminates the need for manually designed prompts by learning optimized embeddings for "normal" and "anomalous" classes, thus reducing dependency on prompt engineering [16][18]. Group 4: Experimental Results - AdaptCLIP achieved an average image-level AUROC of 86.2% across multiple industrial datasets in zero-shot scenarios, outperforming existing methods [31]. - In medical imaging tasks, AdaptCLIP demonstrated an average pixel-level AUPR of 48.7% and an average image-level AUROC of 90.7%, indicating superior performance compared to other approaches [31][32]. Group 5: Efficiency and Scalability - The model introduces approximately 0.6 million additional trainable parameters under zero-shot conditions, significantly lower than competing methods that can exceed 10.7 million parameters [32][37]. - AdaptCLIP maintains a reasonable inference time of about 162 ms per image at a resolution of 518x518, balancing detection accuracy with deployment efficiency [32][37].
他们认识香蕉也认识黄色,却不知道香蕉是黄色的
3 6 Ke· 2026-01-16 07:25
Core Insights - The research conducted by teams from Peking University and Shanxi Medical University reveals that language significantly influences visual perception and knowledge storage in the brain, particularly in individuals with certain neurological conditions [1][5][10]. Group 1: Visual and Language Interaction - Individuals with intact visual function but impaired connections between the visual cortex and language areas struggle to identify colors from grayscale images, indicating that language is crucial for extracting visual knowledge [3][4]. - Blind individuals acquire color knowledge primarily through language, as they lack visual experiences, contrasting with sighted individuals who utilize both visual and linguistic systems for color representation [2][9]. Group 2: AI and Cognitive Research - The study utilized AI models to differentiate the effects of visual and linguistic inputs on perception, demonstrating that language training in AI can mirror human brain activity related to visual processing [7][9]. - The research indicates that language can profoundly affect cognitive processes, challenging the notion that language only influences higher-level cognition and suggesting it also impacts basic sensory perception [10][12]. Group 3: Implications for Cognitive Science - The findings suggest that language is not merely a communication tool but a powerful system that shapes how humans abstract and organize information, potentially altering sensory experiences [12]. - The interplay between cognitive science and AI research is highlighted, as both fields can inform and enhance understanding of human cognition and perception [12].
穷人福音,MIT研究:不用堆显卡,抄顶级模型作业就成
3 6 Ke· 2026-01-09 13:20
Core Insights - The study from MIT reveals that despite the diverse architectures of AI models, their understanding of matter converges as they become more powerful, suggesting a shared cognitive alignment towards physical truths [1][2][3] Group 1: Model Performance and Understanding - The research indicates that as AI models improve in predicting molecular energy, their cognitive approaches become increasingly similar, demonstrating a phenomenon known as representation alignment [3][5] - High-performance models, regardless of their structural differences, compress their feature space to capture essential physical information, indicating a convergence in understanding [5][6] Group 2: Cross-Architecture Alignment - The study highlights that models trained on different modalities, such as text and images, also show a tendency to align in their understanding of concepts, exemplified by the representation of "cats" [9][14] - This alignment suggests that powerful models, regardless of their input type, gravitate towards a unified internal representation of reality [14] Group 3: Implications for AI Development - The findings challenge the necessity of expensive computational resources for training large models, advocating for model distillation where smaller models can mimic the cognitive processes of larger, high-performance models [18][20] - The research emphasizes that the future of scientific AI will focus on achieving convergence in understanding rather than merely increasing model complexity, leading to more efficient and innovative AI solutions [22][24][25]
为什么Agent总是Demo猛如龙实战一条虫?
量子位· 2025-12-22 09:30
Core Viewpoint - The article discusses the limitations of AI agents in real-world applications compared to their impressive demonstrations, emphasizing that adaptability is a key factor for improvement [1]. Summary by Sections Definition and Functionality of Agents - Agents are defined as AI systems that can plan, utilize tools (such as search engines and databases), and remember information to complete complex tasks independently [3]. Adaptability Framework - The core bottleneck in current agent systems is adaptability, specifically how models adjust their behavior based on feedback signals [6]. - A 2x2 classification framework is proposed to categorize existing adaptation methods into four paradigms based on two dimensions: who is optimized (the agent or the tools) and where the feedback signal comes from (tool execution results or agent output evaluations) [7][8][9]. Four Paradigms of Adaptation - **A1 Paradigm**: Agents learn from feedback based on tool execution, such as whether code runs successfully [10]. - **A2 Paradigm**: Uses the agent's final output as the optimization signal, exemplified by models like DeepSeek-R1 that train reasoning capabilities through reinforcement learning [11]. - **T1 Paradigm**: Tools are pre-trained independently and then called by the agent, allowing for plug-and-play functionality [12]. - **T2 Paradigm**: Tools optimize themselves based on the agent's output, creating a symbiotic relationship [13]. Benefits of Classification - This classification helps developers avoid trial and error when improving AI capabilities, allowing for targeted adaptations based on specific needs [15]. - It also clarifies trade-offs: modifying AI (A1/A2) is flexible but costly, while modifying tools (T1/T2) is cheaper but limited by the AI's inherent capabilities [16]. Key Findings on Data Efficiency - The T2 paradigm demonstrates significantly higher data efficiency compared to the A2 paradigm. For instance, the Search-R1 using A2 requires approximately 170,000 training samples, while T2 only needs 2,400 samples, achieving comparable results [18][19][20]. Frontiers in Adaptability Research - The article identifies four cutting-edge directions for agent adaptability research: - **Co-Adaptation**: Aims for agents and tools to optimize together within the same learning cycle, presenting challenges in credit assignment [21]. - **Continual Adaptation**: Addresses the need for agents to continuously learn new skills without forgetting old ones in a changing environment [23]. - **Safe Adaptation**: Highlights concerns that large models may erode safety measures established during supervised fine-tuning, making them more vulnerable to attacks [25]. - **Efficient Adaptation**: Focuses on resource-constrained scenarios, discussing techniques like LoRA and FlashRL for efficient learning [27]. Additional Resources - The article mentions that a GitHub repository has been opened to continuously collect related papers and resources, serving as a guide for developers building agent systems [29].
1100多个模型殊途同归,指向一个「通用子空间」,柏拉图又赢一回?
机器之心· 2025-12-14 04:53
Core Insights - The importance of model architecture may exceed previous understanding, as a study from Johns Hopkins University reveals that over 1,100 different neural networks converge to a shared low-dimensional subspace, suggesting a "prior" mathematical structure that all neural networks approach [1][2][14]. Group 1: Findings and Implications - This discovery helps explain several phenomena, such as why over-parameterized models can generalize, why different initializations lead to similar representations, and the effectiveness of techniques like LoRA and weight sharing [2][14]. - The research provides empirical evidence for the existence of a universal weight subspace hypothesis, indicating that all models may converge to a common subspace, which could limit diversity and introduce inherent biases [8][14][33]. - The study suggests that shared subspaces could enable large-scale model compression, rapid adaptation to new tasks, and insights into generalization boundaries and optimization landscapes [14][15]. Group 2: Methodology and Results - The authors focused on LoRA adapters and observed the emergence of a universal subspace in the Mistral-7B model, extending the analysis to 500 Vision Transformers and 50 LLaMA3-8B models, all trained on different datasets and initializations [11][15]. - The analysis revealed that a unique shared low-rank structure exists across various tasks, with most information concentrated in 16 or fewer subspace directions, supporting the practical utility of the universal subspace [19][22]. - The universal subspace model demonstrated a 19-fold improvement in memory efficiency, as it eliminated the need to store all individual LoRA models [23]. Group 3: Theoretical Considerations - The authors propose several theoretical factors contributing to the emergence of universal subspaces, including neural networks' preference for low-frequency functions, strong inductive biases imposed by modern architectures, and the universal nature of gradient-based optimization methods [36][37].
长文本检索大突破,联通团队研发的新模型,准确率提升近两成
Sou Hu Cai Jing· 2025-12-02 20:15
Core Viewpoint - HiMo-CLIP is a new AI model developed by China Unicom's Data Science and Artificial Intelligence Research Institute, designed to improve the accuracy of image retrieval by automatically identifying key information in complex descriptions, addressing the common issue of "too much detail leading to errors" in AI processing [2][7][21]. Group 1: Model Features - HiMo-CLIP utilizes a specialized module called HiDe, which employs statistical methods to extract the most distinguishing features from similar descriptions, enhancing the model's ability to focus on key attributes [7][8]. - The model achieves an accuracy rate of 89.3%, significantly improving upon previous methods that relied on fixed templates or manual annotations [8]. - HiMo-CLIP's implementation is efficient, requiring minimal hardware resources, with only a 7% increase in inference speed on A100 GPUs, making it accessible for standard servers [10][11]. Group 2: Performance Metrics - The model incorporates a dual alignment mechanism known as MoLo loss, which ensures that both the overall semantic meaning and core feature matching are prioritized, thus preventing the "more detail, more errors" phenomenon [11][13]. - In tests on the MSCOCO-Long dataset, HiMo-CLIP's mean Average Precision (mAP) improved by nearly 20% compared to the previous Long-CLIP model, while maintaining 98.3% of its original performance on short text datasets like Flickr30K [13]. Group 3: Practical Applications - HiMo-CLIP has already been applied in real-world scenarios, such as enhancing product search functionalities on JD.com, where complex user descriptions led to a 27% increase in search conversion rates [14][15]. - The model is also being explored in the autonomous driving sector to interpret complex road descriptions, improving environmental recognition for vehicle systems [18]. Group 4: Future Developments - The team plans to release a multilingual version of HiMo-CLIP by Q3 2026, aiming to handle specialized terminology and foreign language descriptions more effectively [21]. - The success of HiMo-CLIP highlights the importance of simulating human cognitive logic in AI models, suggesting a potential new direction for multimodal intelligence development through structured semantic spaces [21].
NeurIPS 2025 | 上下文元学习实现不微调跨被试脑活动预测
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the development of BraInCoRL, a novel brain encoding model that utilizes meta-learning and context learning to predict brain responses from visual stimuli with minimal data requirements [3][32]. - This model addresses the limitations of traditional visual encoding models, which require extensive data collection for each individual, making them costly and difficult to implement in clinical settings [6][32]. Background and Innovation - The research highlights significant functional differences in the human higher visual cortex among individuals, necessitating the creation of brain encoding models that can effectively represent these differences [2][6]. - BraInCoRL allows for the prediction of brain responses using only a small number of example images and their corresponding brain activity data, eliminating the need for model fine-tuning [3][32]. Methodology - The BraInCoRL framework treats each voxel as an independent function mapping visual stimuli to neural responses, leveraging meta-learning and context learning to enhance data efficiency and generalization [7][10]. - During training, the model learns shared structures of visual cortex responses from multiple subjects, and during testing, it can generate a subject-specific voxel encoder using just a few image-brain response pairs [11][20]. Experimental Results - BraInCoRL demonstrates high data efficiency, achieving comparable variance explanation to models trained on thousands of images while only using 100 context images [20][22]. - The model shows robust performance across different datasets and scanning protocols, confirming its cross-device and cross-protocol generalization capabilities [22][23]. - Semantic clustering visualizations reveal clear functional organization within the visual cortex, with distinct areas for faces, scenes, and other categories [26][27]. Conclusion - BraInCoRL introduces in-context learning to computational neuroscience, creating a data-efficient, interpretable, and language-interactive framework for visual cortex encoding [32]. - This innovation significantly lowers the barriers for constructing individualized brain encoding models, paving the way for applications in clinical neuroscience and other data-limited scenarios [32].