细粒度视觉识别
Search documents
超越CLIP,北大开源细粒度视觉识别大模型,每类识别训练仅需4张图像
3 6 Ke· 2026-02-11 08:03
Core Insights - The research team led by Professor Peng Yuxin from Peking University has made significant advancements in fine-grained visual recognition using multi-modal large models, with their latest paper accepted at ICLR 2026 and made open-source [1][19]. Group 1: Fine-Grained Visual Recognition - The real world exhibits fine-grained characteristics, with objects often containing a rich hierarchy of categories, such as the classification of aircraft into specific models like Boeing 707, 717, and 727, with over 500 types of fixed-wing aircraft recorded globally [2]. - The Fine-R1 model aims to leverage the extensive knowledge of fine-grained subcategories contained within multi-modal large models to achieve fine-grained recognition of visual objects in open domains, overcoming the limitations of traditional methods that focus on a closed set of categories [4]. Group 2: Model Development and Methodology - The Fine-R1 model employs a two-phase approach: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to enhance the model's inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves the model's robustness to intra-class variations and its ability to distinguish between different classes [8]. - The model demonstrates superior performance, achieving higher accuracy in recognizing both seen and unseen subcategories with only four training images per class, surpassing models like OpenAI's CLIP and Google's DeepMind's SigLIP [13][14]. Group 3: Experimental Results - Experimental results indicate that Fine-R1 outperforms various models in both closed-set and open-set recognition tasks, showcasing its effectiveness in fine-grained visual recognition [14][16]. - The model's enhancements are attributed primarily to its improved ability to utilize fine-grained subcategory knowledge rather than merely optimizing visual representations or increasing knowledge reserves [16].
超越CLIP!北大开源细粒度视觉识别大模型,每类识别训练仅需4张图像
量子位· 2026-02-11 01:55
Core Viewpoint - The article discusses the limitations of current multimodal large models in fine-grained visual recognition tasks and introduces the Fine-R1 model developed by Professor Peng Yuxin's team at Peking University, which significantly improves recognition accuracy with minimal training data [1][2][5]. Group 1: Fine-Grained Visual Recognition Challenges - Current multimodal large models excel in complex tasks but lag in fine-grained visual recognition compared to their visual encoders like CLIP [1]. - Real-world objects exhibit fine-grained characteristics, with numerous subclasses, such as over 500 types of fixed-wing aircraft, highlighting the importance of fine-grained recognition in practical applications [3]. Group 2: Fine-R1 Model Overview - The Fine-R1 model aims to leverage the rich knowledge of fine-grained subclasses and a generative decoding paradigm to overcome the limitations of traditional recognition methods, enabling fine-grained recognition of any visual object in an open domain [5]. - Fine-R1 enhances the model's ability to reason about unseen subclasses using a small number of training images (only 4 per subclass), outperforming models like OpenAI's CLIP and Google's DeepMind's SigLIP [5][15]. Group 3: Model Development Process - The development of Fine-R1 involves two main steps: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to build inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves robustness to intra-class variations and inter-class distinctions by using positive and negative samples [8][10]. Group 4: Experimental Results - Fine-R1's performance was evaluated on six authoritative fine-grained image classification datasets, demonstrating superior accuracy in both seen and unseen categories compared to other models [15][17]. - The model's ability to utilize fine-grained subclass knowledge effectively was identified as the primary factor for its improved recognition accuracy, rather than enhancements in visual representation or knowledge storage [19]. Group 5: Conclusion and Future Work - The article concludes with the potential of Fine-R1 to excel in fine-grained visual recognition tasks, emphasizing its innovative approach to reasoning and knowledge application [21]. - The research has been accepted for ICLR 2026 and the code is open-sourced for further exploration [2][22].