HAMi
Search documents
HAMi × NVIDIA:GPU 拓扑感知调度实现详解
AI前线· 2025-10-25 05:32
Core Insights - HAMi is an active open-source project maintained by over 350 contributors from more than 15 countries, adopted by over 200 enterprises and institutions, showcasing its scalability and support capabilities [2] - The introduction of topology-aware scheduling for NVIDIA GPUs in version v2.7.0 addresses communication bottlenecks in high-performance computing (HPC) and AI model training scenarios, optimizing task deployment to enhance overall computational efficiency [2][3] Feature Overview - The core design of HAMi's topology-aware scheduling involves quantifying the physical topology into "communication scores" between devices, allowing the scheduler to make optimal decisions based on these scores [5] - Dynamic calculation of topology scores is facilitated by Device Plugin using NVML to detect physical connections between GPUs, providing a basis for scheduling decisions [6] - The scheduling process consists of two phases: topology registration, which quantifies physical connections into understandable scores, and scheduling decision-making, which selects the optimal devices based on these scores [9][10] Implementation Details - The discovery and quantification of topology information are crucial for subsequent intelligent decision-making, generating a score table for reporting [13] - The Fit function implements a dual-strategy optimization algorithm, ensuring long-term health of cluster topology resources by automatically applying "best match" and "minimal disruption" strategies for multi-GPU and single-GPU tasks respectively [6][22] Usage - Users can enable topology-aware scheduling with a simple annotation, allowing the scheduler to automatically apply the appropriate strategy based on the requested number of GPUs [25][26] - The design philosophy emphasizes dynamic discovery over static configuration and foresighted decision-making over short-sighted allocation, providing a robust GPU scheduling solution for large-scale AI training and HPC tasks in cloud-native environments [27]