超越CLIP，北大开源细粒度视觉识别大模型，每类识别训练仅需4张图像

Core Insights - The research team led by Professor Peng Yuxin from Peking University has made significant advancements in fine-grained visual recognition using multi-modal large models, with their latest paper accepted at ICLR 2026 and made open-source [1][19]. Group 1: Fine-Grained Visual Recognition - The real world exhibits fine-grained characteristics, with objects often containing a rich hierarchy of categories, such as the classification of aircraft into specific models like Boeing 707, 717, and 727, with over 500 types of fixed-wing aircraft recorded globally [2]. - The Fine-R1 model aims to leverage the extensive knowledge of fine-grained subcategories contained within multi-modal large models to achieve fine-grained recognition of visual objects in open domains, overcoming the limitations of traditional methods that focus on a closed set of categories [4]. Group 2: Model Development and Methodology - The Fine-R1 model employs a two-phase approach: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to enhance the model's inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves the model's robustness to intra-class variations and its ability to distinguish between different classes [8]. - The model demonstrates superior performance, achieving higher accuracy in recognizing both seen and unseen subcategories with only four training images per class, surpassing models like OpenAI's CLIP and Google's DeepMind's SigLIP [13][14]. Group 3: Experimental Results - Experimental results indicate that Fine-R1 outperforms various models in both closed-set and open-set recognition tasks, showcasing its effectiveness in fine-grained visual recognition [14][16]. - The model's enhancements are attributed primarily to its improved ability to utilize fine-grained subcategory knowledge rather than merely optimizing visual representations or increasing knowledge reserves [16].