异构算力虚拟化
Search documents
华为发布开源AI容器技术Flex:ai:让闲置算力“动起来”,把一张卡切给多任务使用丨最前线
3 6 Ke· 2025-11-25 13:54
Core Viewpoint - The simultaneous occurrence of "insufficient computing power" and "wasted computing power" is highlighted, with Huawei's release of the AI container technology Flex:ai aimed at improving computing resource utilization through three technological innovations [1] Group 1: Flex:ai Overview - Huawei officially launched the Flex:ai technology at the 2025 AI Container Application Landing and Development Forum, which includes the open-sourcing of the XPU pooling and scheduling software [1][2] - Flex:ai is built on Kubernetes and focuses on the refined management and intelligent scheduling of GPU, NPU, and other intelligent computing resources, consolidating scattered computing power into a "resource pool" [1][2] Group 2: Core Capabilities of Flex:ai - The XPU pooling framework, developed in collaboration with Shanghai Jiao Tong University, allows a single GPU or NPU card to be split into multiple virtual computing units with 10% precision, increasing overall computing utilization by 30% in small model training and inference scenarios [2] - The cross-node remote virtualization technology, developed with Xiamen University, aggregates idle XPU computing power across different machines to form a "shared computing pool," enabling general servers without intelligent computing capabilities to access remote GPU/NPU resources for AI calculations [2] - The Hi Scheduler intelligent scheduler, developed with Xi'an Jiaotong University, addresses the challenge of unified scheduling of heterogeneous computing resources by automatically selecting suitable local or remote resources based on task priority and computing requirements, achieving time-sharing reuse and global optimal scheduling [2] Group 3: Open Source Initiative - Huawei's decision to fully open source Flex:ai aims to provide all core technological capabilities to developers across academia and industry, promoting the construction of standards for heterogeneous computing virtualization and AI application platform integration [2]
华为开源突破性技术Flex:ai,AI算力效率直升30%,GPU、NPU一起用
机器之心· 2025-11-22 04:12
Core Viewpoint - Huawei has launched the AI container technology Flex:ai to address the issue of computing resource waste in the AI industry, which is exacerbated by the rapid growth in AI workloads and low utilization rates of global computing resources [1][3][20]. Group 1: Flex:ai Technology Overview - Flex:ai integrates GPU and NPU resources into a unified system, allowing for dynamic allocation and scheduling of computing resources [1][3]. - The technology is built on the Kubernetes platform and aims to enhance the precision of AI workload matching with computing resources, significantly improving utilization rates [3][19]. Group 2: Key Technological Innovations - The XPU pooling framework developed in collaboration with Shanghai Jiao Tong University allows a single GPU or NPU to be divided into multiple virtual computing units, improving average utilization by 30% while keeping virtualization performance loss below 5% [9]. - The cross-node virtualization technology, developed with Xiamen University, aggregates idle computing resources from various nodes into a shared pool, enabling general servers to offload AI workloads to remote GPU/NPU resources [12]. - Context separation technology designed by Xiamen University reduces external fragmentation by 74% and increases high-priority job throughput by 67% [13]. Group 3: Intelligent Scheduling and Resource Management - The Hi Scheduler, developed with Xi'an Jiaotong University, optimally schedules heterogeneous computing resources across the cluster, ensuring efficient resource utilization even under fluctuating loads [17]. - The increasing demand for AI computing resources highlights the need for improved resource management efficiency, with Flex:ai positioned as a competitive solution against existing technologies like Run:ai [19]. Group 4: Open Source Initiative - Flex:ai will be fully open-sourced to the "Magic Engine Community," contributing to the ModelEngine open-source ecosystem alongside other tools [5]. - The open architecture of Flex:ai is expected to promote the standardization of domestic computing ecosystems and enhance collaboration among global innovators [19][20].