超越CLIP！北大开源细粒度视觉识别大模型，每类识别训练仅需4张图像

Core Viewpoint - The article discusses the limitations of current multimodal large models in fine-grained visual recognition tasks and introduces the Fine-R1 model developed by Professor Peng Yuxin's team at Peking University, which significantly improves recognition accuracy with minimal training data [1][2][5]. Group 1: Fine-Grained Visual Recognition Challenges - Current multimodal large models excel in complex tasks but lag in fine-grained visual recognition compared to their visual encoders like CLIP [1]. - Real-world objects exhibit fine-grained characteristics, with numerous subclasses, such as over 500 types of fixed-wing aircraft, highlighting the importance of fine-grained recognition in practical applications [3]. Group 2: Fine-R1 Model Overview - The Fine-R1 model aims to leverage the rich knowledge of fine-grained subclasses and a generative decoding paradigm to overcome the limitations of traditional recognition methods, enabling fine-grained recognition of any visual object in an open domain [5]. - Fine-R1 enhances the model's ability to reason about unseen subclasses using a small number of training images (only 4 per subclass), outperforming models like OpenAI's CLIP and Google's DeepMind's SigLIP [5][15]. Group 3: Model Development Process - The development of Fine-R1 involves two main steps: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to build inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves robustness to intra-class variations and inter-class distinctions by using positive and negative samples [8][10]. Group 4: Experimental Results - Fine-R1's performance was evaluated on six authoritative fine-grained image classification datasets, demonstrating superior accuracy in both seen and unseen categories compared to other models [15][17]. - The model's ability to utilize fine-grained subclass knowledge effectively was identified as the primary factor for its improved recognition accuracy, rather than enhancements in visual representation or knowledge storage [19]. Group 5: Conclusion and Future Work - The article concludes with the potential of Fine-R1 to excel in fine-grained visual recognition tasks, emphasizing its innovative approach to reasoning and knowledge application [21]. - The research has been accepted for ICLR 2026 and the code is open-sourced for further exploration [2][22].