Workflow
Checkpointing
icon
Search documents
Designing Resilient AI & HPC Systems: Insights from Eviden's Global Deployments
DDNยท 2025-07-14 13:40
Company Overview & HPC Leadership - Evident, a spin-off of ATOS, specializes in high-performance computing (HPC) solutions, including mainframes, processors, and high-speed networks [2] - The company positions itself as the largest HPC provider in Europe and a significant provider in Latin America and India [2][3] - Evident is a partner with DDN, utilizing DDN as its back end for most of its delivered systems [2] HPC & AI Integration - Evident is bringing HPC to AI, exemplified by a cluster of approximately 40 DGX units implemented in Ecuador for computer vision, processing 5 gigabytes per second of video stream from over 18,000 cameras [4][5] - The Ecuador system utilizes a 31 petabyte Luster file system and a 21 petabyte system based on S400 and X2, chosen for its fast failover and recovery capabilities [6] Storage Challenges & Solutions - The presentation highlights challenges related to storage, including inefficient small file storage (7KB) leading to performance issues [7][8] - Another challenge involved seismic image processing requiring 900 gigabytes of data reads and writes, which was addressed using local caching [9][10] Large Language Models (LLMs) & System Failures - Meta's model training, involving an 800 gigabyte model and a cluster of 25,000 GPUs (using only 16 initially), faced frequent failures (419 failures in 45 days, approximately one failure every 3 hours) [11][12] - Checkpointing is deemed a necessity due to the high failure rate of components (GPUs, interconnects, power supplies, software bugs) in large-scale systems [13][14] Scalable & Efficient HPC Systems - Evident is building its first exascale system based on its technology, aiming for high efficiency, with one model ranking number one and another in position six on the Green500 list [15] - The company is delivering approximately five exascale systems in Brazil, utilizing DDN storage solutions focused on IOPS, bandwidth, and space [16][17] Key Takeaways for Large Workloads - Large workloads like LLMs require large systems, which are prone to component failures [17][18] - Mitigating failures requires a robust design in network and storage, compatible with the number of GPUs, and the use of checkpointing [18][19]