模型剪枝

Search documents
ICML 2025 | CoTo:让LoRA训练「渐入佳境」,模型融合、剪枝样样精通
机器之心· 2025-07-26 12:17
Core Viewpoint - The article introduces CoTo, a progressive training strategy designed to enhance the robustness and effectiveness of Low-Rank Adaptation (LoRA) models, addressing issues such as training instability and performance drop after pruning [1][4][23]. Summary by Sections Conventional LoRA Training Issues - LoRA faces challenges including "lazy training," where optimization gets stuck near suboptimal solutions, limiting generalization [7] - There is a hierarchical imbalance in training, with gradient updates concentrated on top layers, leading to undertraining of lower layers [7] - These issues complicate downstream operations like model fusion and pruning, often resulting in unsatisfactory outcomes [7] CoTo Strategy - CoTo employs a simple yet effective progressive activation strategy, initially deactivating a portion of LoRA adapters to encourage uniform gradient flow across all layers [5][8] - The activation probability of adapters is gradually increased during training, returning to standard fine-tuning mode in later stages [8] Experimental Results - CoTo significantly improves the fusion and pruning capabilities of LoRA models, enhancing single-task generalization performance and training efficiency [12][23] - In linear interpolation tasks, CoTo models maintain smooth performance transitions, unlike standard LoRA, which experiences sharp declines [13] - CoTo outperforms standard LoRA in both structured and unstructured pruning scenarios, demonstrating enhanced fault tolerance [17] Performance and Efficiency Improvements - CoTo consistently boosts performance across various benchmarks, including visual and language tasks, and achieves over 24% training acceleration when applied to HiRA [24][23] Ablation Studies - Rigorous ablation studies validate the design choices of CoTo and provide insights into effective regularization of LoRA [21] Conclusion - CoTo effectively resolves hierarchical imbalance and lazy optimization issues in LoRA training, enhancing model robustness and simplifying downstream operations like fusion and pruning [23]
低成本下的高性能模型,是悖论还是可能?
机器之心· 2025-05-31 17:15
Core Viewpoint - The article discusses the paradox of achieving high performance in AI models at low costs, questioning whether the decline in perceived model performance is intentional by AI companies and exploring the implications of cost-saving measures on model quality [2][3]. Group 1: Low-Cost High-Performance Models - The performance and cost dilemma of large language models (LLMs) has been a focal point of public and industry concern, with ongoing discussions about whether top model companies sacrifice precision or service stability to save on inference costs [2][3]. - Following the popularity of ChatGPT, users have expressed dissatisfaction with perceived declines in performance, citing issues such as weakened logic, increased errors, and difficulties in following instructions [2][3]. - The public's concern about companies sacrificing model performance for cost savings is supported by technical and market evidence, particularly highlighted in the controversy surrounding the DeepSeek-R1 model [3][4]. - The true "full version" of DeepSeek-R1 requires significant hardware investment, with initial costs reaching hundreds of thousands of yuan, leading some platforms to potentially use distilled versions that compromise inference capability and stability [3][4]. Group 2: Cost Management Strategies - To balance costs and performance, high-end "full version" models are not widely available, especially in a market flooded with free or low-cost services that often lack sufficient performance [6]. - AI companies are increasingly adopting model distillation or simplified models to reduce inference costs and manage financial investments [6]. - Common strategies to address cost pressures include lowering model precision through techniques such as model quantization, pruning, and knowledge distillation, which have become standard practices in the industry [6].