StreamBP 算法

Search documents
无损减少80%激活值内存,提升5倍训练序列长度,仅需两行代码
机器之心· 2025-06-23 07:44
Core Insights - The article discusses the StreamBP algorithm, which significantly reduces the memory required for training large language models (LLMs) by optimizing the backpropagation process [3][6][15]. Group 1: StreamBP Algorithm - StreamBP reduces the memory consumption of activation values to about 20% of that required by gradient checkpointing, allowing for longer sequence lengths during training [3][6]. - Under the same memory constraints, StreamBP can achieve a maximum sequence length that is 2.8 to 5.5 times greater than that of gradient checkpointing [6][22]. - The algorithm is applicable to common LLM objective functions such as SFT, GRPO, PPO, and DPO, and its code is open-sourced for integration into existing training frameworks [6][12]. Group 2: Memory and Performance Comparison - In terms of memory usage, StreamBP requires only 5% to 15% of the total activation memory for all layers, while a single layer's complete activation values account for over 85% of the memory [13][19]. - A comparison of memory and time costs between standard backpropagation and StreamBP shows that StreamBP significantly reduces peak memory usage while maintaining similar computational costs [14][25]. Group 3: Application in LLM Training - StreamBP is specifically designed to optimize memory usage in the Transformer layers and lmhead layers of LLMs, effectively lowering the memory consumption of layer activations and logits [16][20]. - The algorithm allows for larger batch sizes and faster training times by enabling longer sequence lengths, which is crucial for training efficiency [25][28].