混合专家梯度对齐 - filings, earnings calls, financial reports, news

混合专家梯度对齐

Search documents

单卡即可微调大模型！内存占用仅1/8，性能依然拉满 | ICML 2025

量子位· 2025-05-28 02:23

Core Insights - The article discusses the advancements in low-rank adaptation (LoRA) methods for fine-tuning large pre-trained models, highlighting the introduction of a new framework called GOAT that improves performance while maintaining efficiency [2][3][18]. Group 1: LoRA and Its Challenges - Large foundational models like Qwen, GPT, and DeepSeek R1 are essential in modern deep learning, but their extensive parameter sizes lead to high fine-tuning costs [1]. - Traditional LoRA methods reduce trainable parameters significantly (typically adjusting only 0.1%-5%) but often underperform compared to full fine-tuning [6]. - Existing methods for optimizing LoRA performance, such as random initialization or static singular value decomposition (SVD), fail to fully leverage the knowledge embedded in pre-trained models [6][12]. Group 2: GOAT Framework - The GOAT framework introduces adaptive singular value initialization and mixed expert gradient alignment strategies, addressing the performance limitations of LoRA [3][18]. - GOAT has been validated across 25 multi-domain tasks, achieving performance that matches or exceeds full parameter fine-tuning while only adjusting a minimal percentage of parameters [3][18]. - The framework allows for a significant reduction in memory usage, with training LLaMA7B requiring only 35GB compared to 640GB for full parameter fine-tuning MoE [18]. Group 3: Experimental Results - In natural language generation tasks, GOAT outperformed mainstream LoRA MoE variants by 4.2% in Mt-Bench, 6.3% in GSM8K, and 3.1% in HumanEval, approaching full fine-tuning levels [18]. - In image classification, GOAT achieved 99% of full parameter fine-tuning performance using only 2.24% of the parameters, surpassing other LoRA variants by 6% [18]. - The average accuracy in common sense reasoning tasks reached 82.73%, exceeding ChatGPT by 7.42%, demonstrating strong knowledge transfer capabilities [18].