LoRA
Search documents
1100多个模型殊途同归,指向一个「通用子空间」,柏拉图又赢一回?
机器之心· 2025-12-14 04:53
模型架构的重要性可能远超我们之前的认知。 最近,约翰斯・霍普金斯大学的一项研究发现: 1100 多个不同的神经网络,即使在完全不同的数据集上训练、用不同的初始化和超参数,最终学到的权重都会收 敛到一个共享的低维子空间。 这似乎是说明:存在一个「先验的」数学结构,所有神经网络都在逼近它。训练不是在「创造」什么,而是在「发现」一个早已存在的几何形式。换句话说,神 经网络「想学的东西」似乎高度一致,架构决定了它能学什么,比数据影响更大。 机器之心报道 编辑:张倩 这一发现有助于解释很多「神秘」现象,比如为什么过参数化的模型(参数远多于训练样本)还能泛化?为什么不同初始化最终学到相似的表示?为什么 LoRA、 权重共享这些技术能 work?如果神经网络确实在共享子空间内学习,这将为隐式正则化、可迁移性以及稀疏训练方法的有效性提供支持性解释,同时也为诸如高 效合并、新的优化技术、更快更高效的学习和推理等应用开辟道路。 这篇论文在 Alphaxiv、X 等平台上吸引了很多关注,一度攀升到 Alphaxiv 榜一的位置。 有人说,柏拉图又赢了一回。(注:柏拉图的理念论认为:我们看到的具体事物(桌子、马、圆形)都只是「理念」 ...
今天,好像见证了属于SD时代的消亡。
数字生命卡兹克· 2025-10-13 01:33
Core Viewpoint - The article reflects on the evolution of the AI drawing community, particularly focusing on the transition from the early days of Stable Diffusion (SD) to the current state marked by the launch of liblib 2.0, indicating a significant shift in the landscape of AI tools and user engagement [2][55]. Group 1: Historical Context - The article reminisces about the peak of the SD open-source community, highlighting its rapid growth and the excitement it generated among users [11][31]. - It mentions the initial struggles and learning curves faced by users in understanding complex parameters and prompts necessary for generating images [50][51]. - The community was characterized by a sense of exploration and innovation, with users actively engaging in discussions and sharing techniques [47][41]. Group 2: Transition to Liblib 2.0 - Liblib has announced an upgrade to version 2.0, introducing a new brand, logo, interface, and features aimed at simplifying user experience and expanding its user base [3][67]. - The upgrade signifies a shift towards a more integrated platform that combines various AI drawing and video models, aiming to lower the entry barrier for new users [60][65]. - The article suggests that this transition is a natural progression in the industry, akin to technological advancements that replace older methods [56][57]. Group 3: Community and User Engagement - The article notes a decline in user engagement and interest in the original SD models, as newer, simpler tools have emerged that cater to a broader audience [9][54]. - Despite the changes, the community remains vibrant, with a focus on creativity and the enduring presence of talented creators [75][76]. - The narrative emphasizes that while tools may evolve or disappear, the essence of creativity and the community's spirit will persist [75][76].
ChatGPT架构师,刚发布了最新研究成果
量子位· 2025-09-30 12:22
Core Insights - The article discusses the latest research from Thingking Machines on an efficient fine-tuning method called LoRA, co-authored by John Schulman, a co-founder of OpenAI [1][3][27]. Group 1: Research Findings - The research titled "LoRA Without Regret" explores the conditions under which LoRA can match the efficiency of full fine-tuning (FullFT) and provides a simplified approach to reduce the difficulty of hyperparameter tuning [3][7]. - Current large models often have trillions of parameters and are trained on vast datasets, but downstream tasks typically require only small datasets focused on specific domains [6]. - LoRA, as a parameter-efficient fine-tuning method, captures fine-tuning information through low-rank matrices, and the research confirms that LoRA can achieve similar performance to FullFT by focusing on key details [7][12]. Group 2: Performance Comparisons - The optimal learning rate for LoRA is found to be ten times that of FullFT, demonstrating its capability to compete effectively in fine-tuning scenarios with medium to small datasets [9][12]. - Experiments using Llama 3 and Qwen3 models on specific datasets showed that high-rank LoRA's learning curves closely align with FullFT, with both exhibiting logarithmic decreases in loss values during training [10][11]. - In mathematical reasoning tasks, even with a rank of 1, LoRA's performance remains comparable to FullFT, highlighting its efficiency in information absorption during training [13][14]. Group 3: Application Insights - The research emphasizes that applying LoRA across all layers of a model, rather than just focusing on attention layers, is crucial for maximizing its performance [15][19]. - Previous studies often limited LoRA's application to attention matrices, but this research indicates that broader application leads to significant performance improvements [16][19]. - The findings suggest that the dominant gradient control lies with layers that have more parameters, necessitating full-layer coverage for LoRA to approach FullFT performance [21]. Group 4: Hyperparameter Tuning - The research team proposes a simplified approach to reduce the complexity of tuning LoRA's hyperparameters, identifying that the optimal learning rate consistently follows a specific pattern [22][25]. - Out of four potential hyperparameters, two are deemed redundant, allowing users to focus on "initial update scale" and "steps of deviation from initial state" to streamline the tuning process [25][26]. - This simplification effectively reduces the tuning difficulty of LoRA by half, making it more accessible for users [26].
Thinking Machines又发高质量博客:力推LoRA,不输全量微调
机器之心· 2025-09-30 10:38
Core Insights - The article emphasizes the advantages of LoRA (Low-Rank Adaptation) over Full Fine-tuning (FullFT) in terms of cost-effectiveness and performance in various training scenarios [2][7][18]. Group 1: Importance of LoRA - LoRA is a popular parameter-efficient fine-tuning method that updates a low-dimensional adapter instead of the entire model weights, leading to lower memory requirements and faster loading [11][13]. - The research indicates that LoRA can achieve performance comparable to FullFT in small to medium-sized datasets, while it may struggle in large datasets due to capacity limitations [14][22]. Group 2: Key Findings - The study found that LoRA's performance is closely tied to the training conditions, including the size of the training dataset and the rank of the LoRA parameters [16][25]. - In reinforcement learning tasks, even with a very low rank (rank=1), LoRA can perform similarly to FullFT, indicating that reinforcement learning has lower capacity demands [29]. Group 3: Experimental Methodology - The research utilized models like LLaMA 3 and Qwen3, adjusting LoRA ranks from 1 to 512 and scanning learning rates to find optimal training conditions [20][21]. - Results showed that high-rank LoRA performed almost identically to FullFT in certain datasets, but performance varied across different tasks due to training dynamics [22][24]. Group 4: Practical Implications - LoRA's optimal learning rate is typically about 10 times that of FullFT, allowing it to accept higher learning rates under the same conditions [35]. - The study suggests that applying LoRA across all layers, especially MLP and MoE layers, is crucial for achieving performance close to FullFT [37].
ICML 2025 | CoTo:让LoRA训练「渐入佳境」,模型融合、剪枝样样精通
机器之心· 2025-07-26 12:17
Core Viewpoint - The article introduces CoTo, a progressive training strategy designed to enhance the robustness and effectiveness of Low-Rank Adaptation (LoRA) models, addressing issues such as training instability and performance drop after pruning [1][4][23]. Summary by Sections Conventional LoRA Training Issues - LoRA faces challenges including "lazy training," where optimization gets stuck near suboptimal solutions, limiting generalization [7] - There is a hierarchical imbalance in training, with gradient updates concentrated on top layers, leading to undertraining of lower layers [7] - These issues complicate downstream operations like model fusion and pruning, often resulting in unsatisfactory outcomes [7] CoTo Strategy - CoTo employs a simple yet effective progressive activation strategy, initially deactivating a portion of LoRA adapters to encourage uniform gradient flow across all layers [5][8] - The activation probability of adapters is gradually increased during training, returning to standard fine-tuning mode in later stages [8] Experimental Results - CoTo significantly improves the fusion and pruning capabilities of LoRA models, enhancing single-task generalization performance and training efficiency [12][23] - In linear interpolation tasks, CoTo models maintain smooth performance transitions, unlike standard LoRA, which experiences sharp declines [13] - CoTo outperforms standard LoRA in both structured and unstructured pruning scenarios, demonstrating enhanced fault tolerance [17] Performance and Efficiency Improvements - CoTo consistently boosts performance across various benchmarks, including visual and language tasks, and achieves over 24% training acceleration when applied to HiRA [24][23] Ablation Studies - Rigorous ablation studies validate the design choices of CoTo and provide insights into effective regularization of LoRA [21] Conclusion - CoTo effectively resolves hierarchical imbalance and lazy optimization issues in LoRA training, enhancing model robustness and simplifying downstream operations like fusion and pruning [23]
充分激发模态协作,MokA量身打造MLLM微调新范式
机器之心· 2025-06-29 02:21
Core Viewpoint - The article discusses the limitations of current multimodal large model (MLLM) fine-tuning methods, which often replicate strategies from unimodal language models without considering the unique characteristics of multimodal learning [2][9][23]. Summary by Sections Introduction to MLLMs - Recent advancements in MLLMs have been significant in tasks involving visual-language and audio-language [2]. - Current fine-tuning methods primarily adapt strategies from unimodal language models, such as LoRA, which may not be suitable for multimodal contexts [2][8]. Limitations of Current Fine-Tuning Methods - Many efficient multimodal fine-tuning methods overlook the essential differences between modalities, leading to inadequate utilization of multimodal information [9][11]. - The article emphasizes the need for both unimodal adaptation and cross-modal adaptation in effective multimodal fine-tuning [9][12]. Introduction of MokA Method - The research team proposes a new method called MokA (Multimodal low-rank Adaptation), which balances the independent modeling of unimodal information and the interaction modeling between modalities [3][12][23]. - MokA retains the efficiency of LoRA while redefining the roles of projection matrices in a multimodal context [14][23]. Key Components of MokA - MokA includes three critical modules: 1. **Modality-specific A matrix**: Ensures independent modeling of unimodal information [15]. 2. **Cross-modal attention mechanism**: Enhances interaction between different modalities during instruction tuning [16]. 3. **Shared B matrix**: Facilitates implicit cross-modal alignment by projecting modalities into a shared space [17]. Experimental Results - MokA was evaluated across three representative multimodal task scenarios: audio-visual-text, visual-text, and speech-text [19]. - The method demonstrated significant performance improvements on various benchmark datasets, showcasing its adaptability and effectiveness [19][23]. Conclusion - MokA addresses the oversight of modality differences in current fine-tuning paradigms, providing a new direction for multimodal large model fine-tuning [23].
LoRA中到底有多少参数冗余?新研究:砍掉95%都能保持高性能
机器之心· 2025-05-02 04:39
Core Viewpoint - The article introduces the LoRI technology, which demonstrates that significantly reducing the trainable parameters of LoRA can still maintain strong model performance, achieving comparable or superior results to full fine-tuning and other methods while using only 5% of LoRA's parameters [1][9]. Summary by Sections LoRA and Its Limitations - LoRA is widely adopted for parameter-efficient fine-tuning (PEFT) but still incurs significant memory overhead, especially in large models [3][4]. - Recent research indicates substantial redundancy in incremental parameters, prompting the development of LoRI, which reduces the number of trainable parameters while preserving model knowledge [4]. LoRI Methodology - LoRI keeps the low-rank matrix A fixed as a random projection and uses a task-specific sparse mask to train matrix B, allowing for significant parameter reduction [4][13]. - Even with 90% sparsity in B, LoRI maintains good performance, indicating that the adaptation process does not require updating A [4][17]. Multi-Task Learning and Adapter Merging - Multi-task learning is essential for creating versatile models, but training on mixed datasets is costly. LoRI allows for the merging of existing models without retraining, effectively combining LoRA adapters for multi-task capabilities [7]. - Directly merging heterogeneous LoRA can lead to parameter interference, but LoRI mitigates this by mapping task-specific adapters to nearly orthogonal subspaces [7][20]. Continuous Learning and Safety - LoRI provides a lightweight continuous learning method that maintains safety while adapting to new tasks, addressing the challenge of catastrophic forgetting [8][22]. - The two-phase training process for safety adapters shows that LoRI-S outperforms other methods in retaining safety alignment, even under aggressive sparsity [22][23]. Performance Evaluation - Extensive experiments on various benchmarks show that LoRI achieves or exceeds the performance of full fine-tuning and other PEFT methods while using 95% fewer trainable parameters [9][19]. - In single-task performance, LoRI variants demonstrate competitive results across natural language understanding, mathematics, programming, and safety tasks [19][20]. Conclusion - Overall, LoRI presents an effective and lightweight approach to building safe adapters that support downstream task adaptation while maintaining alignment [23].