Workflow
1.58-bit量化
icon
Search documents
1.58bit不输FP16!微软推出全新模型蒸馏框架,作者全是华人
量子位· 2025-10-20 03:46
Core Insights - Microsoft has introduced a new distillation framework called BitNet Distillation (BitDistill), which achieves model quantization with minimal performance loss while reducing memory consumption to 1/10 of FP16 [1][6][22]. Group 1: Framework Overview - BitDistill has been validated on models with 4 billion parameters and below, such as Qwen and Gemma, and is theoretically applicable to other Transformer models [2]. - The framework consists of three interconnected stages: Model Refinement, Continue Pre-training, and Distillation-based Fine-tuning [8]. Group 2: Model Structure Optimization - The primary goal of model structure optimization is to support the training of 1.58-bit models and address optimization instability issues common in low-precision training [9]. - BitDistill introduces a normalization module called SubLN in each Transformer layer to enhance training stability by controlling the variance of activations [10][12]. Group 3: Continue Pre-training - A lightweight continue pre-training phase is designed to help the model gradually adapt its weights from full precision to a distribution suitable for 1.58-bit representation [14][15]. - This phase allows the model to "learn how to be quantized," preventing information loss during the fine-tuning stage [16]. Group 4: Distillation-based Fine-tuning - BitDistill employs a dual distillation mechanism—Logits distillation and multi-head attention distillation—to recover the performance of the quantized model [18]. - Logits distillation uses the probability distribution from the full precision model as "soft labels" to guide the quantized model [19]. Group 5: Performance Evaluation - BitDistill demonstrates performance nearly equivalent to full precision models across various downstream tasks while significantly reducing memory usage and improving inference speed [22]. - In text classification tasks, the 1.58-bit model achieved accuracy levels comparable to full precision fine-tuned models, outperforming directly quantized models [23][24]. - In text summarization tasks, BitDistill's generated text quality was nearly identical to that of full precision models, with slight improvements in BLEU scores [25][27]. Group 6: Generalizability and Compatibility - BitDistill has been successfully applied to other pre-trained models like Gemma and Qwen2.5, showing high fidelity in performance recovery [28]. - The framework is compatible with various quantization strategies, proving its utility as an independent distillation solution applicable to multiple post-quantization optimization scenarios [28].