持续多模态指令微调 - filings, earnings calls, financial reports, news

持续多模态指令微调

Search documents

ICML 2025 | 给AI装上「智能升级插件」！阿里安全-清华大学D-MoLE让模型在持续学习中动态进化

机器之心· 2025-07-10 04:26

Core Viewpoint - The article discusses the development of a new framework called D-MoLE (Dynamic Mixture of Curriculum LoRA Experts) aimed at enhancing the continual adaptation capabilities of Multimodal Large Language Models (MLLMs) in response to evolving task requirements while preserving existing knowledge [4][12][13]. Research Background - Multimodal Large Language Models (MLLMs) combine various modalities such as visual and textual data, showcasing strong capabilities in handling multimodal information [3]. - A significant challenge in practical applications is the phenomenon of catastrophic forgetting, where models lose previously acquired knowledge when fine-tuned for new tasks [4]. Key Challenges - The need for continual multimodal instruction tuning (CMIT) arises to allow MLLMs to adapt to new tasks while retaining past knowledge [4][12]. - Two main challenges identified are task architecture conflicts and modality imbalance, where different tasks have varying dependencies on model layers and modalities [4][7]. Proposed Solution - D-MoLE framework allows dynamic adjustment of model architecture based on task requirements, introducing additional parameter modules (LoRA experts) as needed [10][13]. - It incorporates a gradient-based continual curriculum strategy to balance updates across different modalities, ensuring more equitable optimization [10][12]. Methodology - D-MoLE consists of two core components: a dynamic layer-wise expert allocator and a gradient-based inter-modal continual curriculum mechanism [16][22]. - The dynamic allocator identifies critical layers for adaptation and allocates LoRA experts accordingly, while the curriculum mechanism adjusts the update ratio between language models and modality encoders based on task difficulty [22][24]. Experimental Results - D-MoLE was evaluated on a benchmark comprising nine datasets across visual question answering, image captioning, and visual grounding [27]. - The framework demonstrated significant improvements over baseline methods, achieving an average performance increase of approximately 15.08% and reducing backward transfer (BWT) from -21.31% to -1.49% [29]. General Capability Assessment - D-MoLE maintained strong general multimodal capabilities, outperforming traditional methods in various evaluation benchmarks [30][31]. Training Efficiency - Despite the introduction of new mechanisms, D-MoLE's total training time was comparable to traditional methods, demonstrating efficiency in training through selective parameter updates [36]. Business Application - D-MoLE can enhance Alibaba's security multimodal auditing models, allowing for rapid adaptation to different platform rules without extensive retraining, thus reducing operational costs and improving flexibility [38][39].