Workflow
MLCD(MVT v1.1)
icon
Search documents
ICCV25 Highlight|格灵深瞳RICE模型狂刷榜单,让AI「看懂」图片的每个细节
机器之心· 2025-10-29 07:23
Core Viewpoint - The article highlights the impressive performance of the RICE model (MVT v1.5) developed by DeepGlint's inspiration team, which has excelled in various visual tasks and received recognition at the ICCV25 conference [2][27]. Summary by Sections MVT Series Overview - The MVT series focuses on enhancing visual semantic representation using large datasets, inspired by DeepGlint's expertise in facial recognition algorithms [5][7]. - MVT v1.0 utilized the advanced CLIP pre-training model for feature extraction from vast image-text datasets, achieving state-of-the-art (SOTA) results in image classification and retrieval tasks [5][7]. RICE Model Development - RICE (MVT v1.5) builds on previous models by understanding the composition of image semantics, recognizing that images often consist of multiple loosely related visual elements [9][27]. - The model incorporates character blocks as semantic information, utilizing SAM for object feature extraction from a dataset of 400 million images, resulting in 2 billion region-level objects clustered into one million semantic categories [9][11]. Training Methodology - Each image in RICE's training process involves learning approximately 10 region-level objects, employing a Region Attention Layer to accelerate model training [11][13]. - The model's architecture uses a classic ViT structure, enhancing the semantic representation of internal visual features as training progresses [13][27]. Experimental Validation - RICE has undergone extensive experimental validation across various downstream tasks, demonstrating superior performance in detection tasks on datasets like COCO and LVIS, as well as Roboflow100 [17][20]. - In multi-modal segmentation tasks, RICE has shown significant improvements using the LLaVA series framework [18][23]. Multi-Modal Applications - RICE serves as a visual encoder in the LLaVA-OneVision-1.5 model, achieving competitive results against other leading models in various benchmarks [25][27]. - The model's compatibility with optical character recognition (OCR) tasks provides a notable advantage in multi-modal applications [23][27]. Future Directions - The MVT series is set to advance to version 2.0, focusing on video encoding, which is seen as a critical step towards achieving artificial general intelligence (AGI) [27].