Workflow
Image Captioning
icon
Search documents
3B Image Captioning小钢炮重磅来袭,性能比肩Qwen2.5-VL-72B
机器之心· 2025-10-28 04:31
Core Insights - The article introduces a new technology in Dense Image Captioning called CapRL (Captioning Reinforcement Learning), which successfully applies reinforcement learning methods to image captioning tasks, redefining the reward system based on practicality [2][6][10] - The CapRL-3B model achieves captioning performance comparable to Qwen2.5-VL-72B, marking a significant advancement in the field of image captioning and providing important insights for applying GRPO strategies to open tasks [2][12] Summary by Sections Introduction to CapRL - CapRL is a novel approach that addresses the challenge of designing rewards for subjective image description tasks by defining objective verifiable rewards based on practicality [6][10] - The model has been trained to generate high-quality captions that improve upon previous methods, avoiding issues like reward hacking [8][10] Limitations of Existing Methods - Most current image captioning models rely on supervised fine-tuning (SFT), which has limitations such as high costs and lack of generalization due to dependence on large, manually annotated datasets [7][8] - The subjective nature of image descriptions complicates the design of reliable reward functions, leading to potential issues in model training [7][8] CapRL Framework - The CapRL framework employs a two-stage decoupled training process where a language model answers visual questions based on generated captions, using the accuracy of these answers as an objective reward signal [10][13] - This innovative approach significantly enhances the quality of generated captions, improving accuracy and detail coverage while reducing hallucinations [10][11] Experimental Results - The CapRL-3B model was evaluated on the CapRL-5M dataset, showing significant performance improvements across 12 benchmark tests compared to previous models like ShareGPT4V and DenseFusion [12][14] - In direct assessments of caption quality, CapRL-3B's performance is comparable to that of larger models, demonstrating an average improvement of 8.4% over baseline models [12][15] Conclusion and Future Work - The CapRL framework has been open-sourced, with ongoing iterations to enhance its capabilities, inviting further use and exploration by the community [12][19]