GPU Pooling
Search documents
阿里云秘密武器亮相顶会:狂砍82%英伟达含量,213块GPU干了1192块的活
量子位· 2025-10-21 23:50
Core Viewpoint - Alibaba Cloud has introduced a new GPU pooling system called Aegaeon, which significantly reduces the demand for NVIDIA GPUs by 82% through innovative resource allocation techniques [1][3]. Group 1: Research Background - The research was conducted in collaboration with Peking University, led by Alibaba Cloud's CTO Zhou Jingren [2]. - The study identified that 17.7% of GPU resources were allocated to underutilized models, which only accounted for 1.35% of total request volume [4]. Group 2: Aegaeon's Innovations - Aegaeon addresses the inefficiencies in GPU resource allocation by implementing token-level automatic scaling technology, allowing for dynamic model switching during token generation rather than waiting for entire requests to complete [10][11]. - The system has achieved a 97% reduction in the overhead associated with automatic scaling through various optimizations, including an 80% reduction in initialization overhead and improved memory management [14][15]. Group 3: Performance Outcomes - Aegaeon has demonstrated performance improvements of up to 9 times, with a minimum of 1.5 times, compared to existing systems like ServerlessLLM and MuxServe [18]. - In practical deployment, Aegaeon has serviced 47 models of varying sizes, increasing GPU utilization from 13.3%-33.9% to 48.1% without any service level objective violations or interruptions [20].