用两个简单模块实现分割理解双重SOTA!华科大白翔团队等推出多模态新框架
量子位·2025-10-03 04:19

Core Insights - The article discusses the evolution of multimodal large models from text-to-image generation to pixel-level tasks such as image segmentation, highlighting the challenges of imprecise segmentation results and hallucinations during understanding [1][2]. Group 1: Model Development - The research teams from Huazhong University of Science and Technology and Kingsoft Office proposed two core modules: Semantic Enhanced Feature Extractor (SEFE) and Interleaved Local Visual Coupling (ILVC) to address segmentation accuracy and hallucination issues [3][24]. - SEFE enhances object attribute reasoning by integrating semantic features with pixel-level features, leading to more precise segmentation results [4][25]. - ILVC provides fine-grained supervision by generating local descriptions based on segmentation masks, effectively reducing hallucinations [5][26]. Group 2: Model Performance - The newly developed multimodal large model, LIRA, achieved state-of-the-art (SOTA) performance in both segmentation and understanding tasks [6]. - Compared to InternVL2, LIRA maintains understanding performance while additionally supporting image segmentation tasks; it shows an average improvement of 8.5% in segmentation tasks over OMG-LLaVA and a 33.2% enhancement on MMBench [7]. Group 3: Experimental Results - LIRA demonstrated superior performance across multiple understanding and segmentation datasets, with a slight performance drop of only 0.2% when jointly trained on both comprehension and segmentation datasets [40]. - The integration of SEFE and ILVC resulted in a reduction of hallucination rates by 3.0% and 4.8% for models of sizes 1.8B and 7B, respectively [38]. Group 4: Future Directions - The article suggests that future research should explore the relationship between text and visual tokens, which may provide new insights for enhancing the understanding and segmentation capabilities of multimodal large models [43].