Workflow
EXAScaler
icon
Search documents
KV Cache Acceleration of vLLM using DDN EXAScaler
DDN· 2025-11-11 16:44
AI Inference Challenges & KV Caching Solution - AI inference faces challenges with large context windows, impacting tokenization and latency [1][2] - Caching context tokens speeds up responsiveness, lowers latency, and allows storing larger context amounts [4] - Effective caching requires storage systems with low latency and large capacity at scale [5] DDN's Solution & Performance - DDN's Exoscaler platform enables high-performance KV caching for AI inference, improving user concurrency, responsiveness, and user experience [7] - DDN leverages GPU direct storage (GDS) for cached engine [9] - Caching demonstrates a 10x improvement in performance with larger context [14] - DDN's Exoscaler performance can improve time to first token during inference by 10-25x [16] - DDN improves response times, provides larger cache repository space, and delivers cost-effective performance and capacity density [17] Capacity Implications - KV caching accelerates the end-user experience, putting a premium on high-performance shared storage [16] - Approximately 200,000 input characters resulted in a cache of 796 files, totaling almost 13 gigabytes [15]
EXAScaler Multi-Tenancy Demo
DDN· 2025-09-17 23:03
Core Functionality - Exoscaler data intelligence platform supports multi-tenancy by leveraging VLANs and secure data partitions [1][2] - Client access controls prevent unauthorized data access, enhancing security [2] - Capacity management controls via quotas allow flexible space allocation to tenants [2] Technical Implementation - Network configuration utilizes VLANs with paired IP addresses for intracluster networking and tenant connections [3] - Each tenant maps to two IPs for multiple connections to each VLAN, ensuring high availability [3] - Multi-tenancy is enabled via EMF settings and synced across the cluster [4] - Clients without registered IPs on the appropriate VLAN lose system access due to VLAN isolation [5] Quota Management - Hard quotas enforce strict limits, preventing tenants from exceeding allocated capacity, ensuring total capacity of all tenants never exceed the cluster's capacity [7][9] - Soft quotas allow tenants to use shared capacity by overallocating quotas, potentially leading to less waste but requiring trust [7][10] - Hybrid approach combines soft and hard quotas, providing leeway while preventing excessive consumption of free space [11][12] Data Handling - The system supports on-the-fly quota adjustments while serving data to clients [9] - Demonstrated the creation of a 10 TB (Terabyte) test file to illustrate quota enforcement [8]