Workflow
AcneDGNet
icon
Search documents
内存占用最高降低75%,美国能源部科学家提出跨通道分层聚合方法D-CHAG,实现极大规模模型多通道数据集运行
3 6 Ke· 2026-02-11 09:17
Core Insights - The article discusses the introduction of a distributed cross-channel hierarchical aggregation method (D-CHAG) by scientists at the Oak Ridge National Laboratory, aimed at enhancing the processing of large-scale models on multi-channel datasets [1][2][4] Group 1: Methodology and Performance - D-CHAG enables distributed processing of the tokenization process and employs a hierarchical strategy for channel aggregation, allowing large-scale models to operate efficiently on multi-channel datasets [2][4] - In evaluations on hyperspectral imaging and weather prediction tasks, D-CHAG demonstrated a memory usage reduction of up to 75% when combined with tensor parallelism and model sharding on the Frontier supercomputer, achieving over 2x throughput improvement on up to 1,024 AMD GPUs [2][4] - The method addresses memory bottlenecks and computational efficiency issues in multi-channel foundational model training, achieving up to 70% memory reduction compared to using only tensor parallelism [4] Group 2: Data Utilization - The research utilized two typical multi-channel datasets: hyperspectral images of poplar trees, containing 494 images with 500 spectral channels, and the ERA5 high-resolution reanalysis dataset for weather prediction, which included 80 input channels derived from various atmospheric and surface variables [5][6] - The hyperspectral dataset is crucial for biomass research and plant phenotyping, while the weather prediction dataset was adapted for model training through regridding techniques [5][6] Group 3: Technical Advantages - D-CHAG combines distributed tokenization and hierarchical channel aggregation, reducing memory usage per cross-channel attention layer by processing fewer channels at each layer [9][11] - The method allows for efficient training of larger models on high-channel datasets, supporting configurations that were previously unmanageable with standard tensor parallelism [25] Group 4: Comparative Analysis - The performance of D-CHAG was compared against baseline methods, showing consistent training loss in hyperspectral image applications and significant improvements in weather prediction tasks across various metrics [20][21] - For models with 1.7 billion parameters, D-CHAG configurations demonstrated performance enhancements of approximately 60% for 1,024-channel data, while maintaining efficiency for 512-channel data [15][25]