Workflow
BitDistill
icon
Search documents
微软BitDistill将LLM压缩到1.58比特:10倍内存节省、2.65倍CPU推理加速
机器之心· 2025-10-20 07:48
Core Insights - The article discusses the challenges of deploying large language models (LLMs) efficiently in downstream applications, particularly on resource-constrained devices like smartphones, due to high memory and computational costs [1][7] - A new approach called BitDistill is introduced, which aims to compress existing pre-trained LLMs into a 1.58-bit BitNet model while minimizing performance loss and training costs [4][19] Group 1: Challenges and Solutions - LLMs face significant deployment challenges as their scale increases, leading to instability in training and performance degradation when quantized to lower bit representations [2][10] - The introduction of extreme low-bit LLMs, such as BitNet, aims to reduce memory usage and accelerate inference, but achieving comparable accuracy to high-precision models requires extensive pre-training [1][4] Group 2: BitDistill Framework - BitDistill consists of three key stages: model refinement, continuous pre-training, and distillation-based fine-tuning [8][12] - The first stage addresses activation variance issues in low-bit models by introducing additional normalization layers to stabilize the optimization process [9][30] - The second stage involves continuous training with a small amount of pre-training data to adapt the model to the 1.58-bit representation before fine-tuning on specific tasks [11][32] - The third stage employs knowledge distillation techniques to align the performance of the quantized model with that of the full-precision teacher model [13][27] Group 3: Experimental Results - BitDistill demonstrates excellent scalability, achieving performance comparable to full-precision baselines while providing significant improvements in inference speed (approximately 2x) and memory usage (nearly 10x reduction) [19][20] - Experiments on text classification and summarization tasks show that the 1.58-bit BitDistill model maintains high accuracy and quality, with results indicating a strong performance across various model sizes [16][21] - The method exhibits cross-architecture generality, maintaining stable performance even when using different pre-trained models [22] Group 4: Ablation Studies - Ablation studies indicate that each stage of the BitDistill process is crucial for achieving the desired balance between efficiency and accuracy, with the removal of any stage leading to significant performance drops [25][26] - The combination of logits and attention distillation techniques yields the best results, highlighting the importance of using multiple strategies to mitigate quantization challenges [27][29]