HAMi
Search documents
36氪首发 | 开源异构算力调度平台「密瓜智能」获复星创富数千万元投资,为企业提供高效灵活算力解决方案
3 6 Ke· 2026-01-06 04:33
Core Insights - The emergence of large models has made GPU computing power more valuable than gold, yet there is significant wastage due to low utilization rates of 10%-20% globally [1] - The company Dynamia.ai has completed an angel round of financing to develop a heterogeneous computing virtualization and scheduling management platform, with funds primarily allocated for the HAMi open-source ecosystem and industrialization [1] Heterogeneous Computing Fragmentation - The development of domestic computing power and diverse AI chips has led to a more complex internal computing environment for enterprises, creating challenges in management and scheduling [2] - The key issues include the inability to unify scheduling of heterogeneous computing resources, insufficient resource sharing efficiency, and low utilization rates, which are critical problems in AI infrastructure [2] Computing Resource Allocation - Dynamia.ai's HAMi platform enables deep virtualization and pooling management, decoupling computing resources from physical hardware [4] - The technology allows for fine-grained slicing and memory over-provisioning, significantly increasing the density of tasks on a single GPU [4] Application Cases - In a case with SF Technology, the platform enabled the deployment of 19 testing services on just 6 GPUs, saving 13 GPUs and doubling resource efficiency [5][6] - In the PREP EDU case in Vietnam, the platform optimized GPU infrastructure by 90% and reduced cluster pain points by 50% through effective scheduling [5][6] Cross-Vendor Adaptation - The platform has adapted to over 9 types of chips, including NVIDIA and Huawei Ascend, and supports dynamic Multi-Instance GPU (MIG) configurations for standardized management [8] - It features automatic elastic scaling and priority mechanisms to ensure core business tasks receive resource priority during high demand [8] Commercialization and Services - Dynamia.ai is developing commercial products and technical services based on HAMi to provide robust engineering capabilities and operational support for enterprises [10] - The company has already secured 2 million yuan in product orders within its first quarter of operation [5] Founding Team and Vision - The founding team has extensive experience in cloud computing and AI infrastructure, with key members being contributors to Kubernetes and other CNCF projects [11] - The vision is to make heterogeneous computing as accessible as utilities, establishing a global leading computing scheduling ecosystem to empower AI industry efficiency [12] Investor Perspectives - Investors see heterogeneous computing as a long-term trend in the computing market, emphasizing the importance of efficient resource utilization and cost savings [12] - The open-source HAMi project is viewed as a scalable solution that aligns with the collaborative trends in the AI industry, aiming to enhance return on investment for global clients [12][13]
HAMi × NVIDIA:GPU 拓扑感知调度实现详解
AI前线· 2025-10-25 05:32
Core Insights - HAMi is an active open-source project maintained by over 350 contributors from more than 15 countries, adopted by over 200 enterprises and institutions, showcasing its scalability and support capabilities [2] - The introduction of topology-aware scheduling for NVIDIA GPUs in version v2.7.0 addresses communication bottlenecks in high-performance computing (HPC) and AI model training scenarios, optimizing task deployment to enhance overall computational efficiency [2][3] Feature Overview - The core design of HAMi's topology-aware scheduling involves quantifying the physical topology into "communication scores" between devices, allowing the scheduler to make optimal decisions based on these scores [5] - Dynamic calculation of topology scores is facilitated by Device Plugin using NVML to detect physical connections between GPUs, providing a basis for scheduling decisions [6] - The scheduling process consists of two phases: topology registration, which quantifies physical connections into understandable scores, and scheduling decision-making, which selects the optimal devices based on these scores [9][10] Implementation Details - The discovery and quantification of topology information are crucial for subsequent intelligent decision-making, generating a score table for reporting [13] - The Fit function implements a dual-strategy optimization algorithm, ensuring long-term health of cluster topology resources by automatically applying "best match" and "minimal disruption" strategies for multi-GPU and single-GPU tasks respectively [6][22] Usage - Users can enable topology-aware scheduling with a simple annotation, allowing the scheduler to automatically apply the appropriate strategy based on the requested number of GPUs [25][26] - The design philosophy emphasizes dynamic discovery over static configuration and foresighted decision-making over short-sighted allocation, providing a robust GPU scheduling solution for large-scale AI training and HPC tasks in cloud-native environments [27]