Workflow
多模态微调
icon
Search documents
充分激发模态协作,MokA量身打造MLLM微调新范式
机器之心· 2025-06-29 02:21
Core Viewpoint - The article discusses the limitations of current multimodal large model (MLLM) fine-tuning methods, which often replicate strategies from unimodal language models without considering the unique characteristics of multimodal learning [2][9][23]. Summary by Sections Introduction to MLLMs - Recent advancements in MLLMs have been significant in tasks involving visual-language and audio-language [2]. - Current fine-tuning methods primarily adapt strategies from unimodal language models, such as LoRA, which may not be suitable for multimodal contexts [2][8]. Limitations of Current Fine-Tuning Methods - Many efficient multimodal fine-tuning methods overlook the essential differences between modalities, leading to inadequate utilization of multimodal information [9][11]. - The article emphasizes the need for both unimodal adaptation and cross-modal adaptation in effective multimodal fine-tuning [9][12]. Introduction of MokA Method - The research team proposes a new method called MokA (Multimodal low-rank Adaptation), which balances the independent modeling of unimodal information and the interaction modeling between modalities [3][12][23]. - MokA retains the efficiency of LoRA while redefining the roles of projection matrices in a multimodal context [14][23]. Key Components of MokA - MokA includes three critical modules: 1. **Modality-specific A matrix**: Ensures independent modeling of unimodal information [15]. 2. **Cross-modal attention mechanism**: Enhances interaction between different modalities during instruction tuning [16]. 3. **Shared B matrix**: Facilitates implicit cross-modal alignment by projecting modalities into a shared space [17]. Experimental Results - MokA was evaluated across three representative multimodal task scenarios: audio-visual-text, visual-text, and speech-text [19]. - The method demonstrated significant performance improvements on various benchmark datasets, showcasing its adaptability and effectiveness [19][23]. Conclusion - MokA addresses the oversight of modality differences in current fine-tuning paradigms, providing a new direction for multimodal large model fine-tuning [23].