High Performance Computing

Search documents
Designing Resilient AI & HPC Systems: Insights from Eviden's Global Deployments
DDNยท 2025-07-14 13:40
Company Overview & HPC Leadership - Evident, a spin-off of ATOS, specializes in high-performance computing (HPC) solutions, including mainframes, processors, and high-speed networks [2] - The company positions itself as the largest HPC provider in Europe and a significant provider in Latin America and India [2][3] - Evident is a partner with DDN, utilizing DDN as its back end for most of its delivered systems [2] HPC & AI Integration - Evident is bringing HPC to AI, exemplified by a cluster of approximately 40 DGX units implemented in Ecuador for computer vision, processing 5 gigabytes per second of video stream from over 18,000 cameras [4][5] - The Ecuador system utilizes a 31 petabyte Luster file system and a 21 petabyte system based on S400 and X2, chosen for its fast failover and recovery capabilities [6] Storage Challenges & Solutions - The presentation highlights challenges related to storage, including inefficient small file storage (7KB) leading to performance issues [7][8] - Another challenge involved seismic image processing requiring 900 gigabytes of data reads and writes, which was addressed using local caching [9][10] Large Language Models (LLMs) & System Failures - Meta's model training, involving an 800 gigabyte model and a cluster of 25,000 GPUs (using only 16 initially), faced frequent failures (419 failures in 45 days, approximately one failure every 3 hours) [11][12] - Checkpointing is deemed a necessity due to the high failure rate of components (GPUs, interconnects, power supplies, software bugs) in large-scale systems [13][14] Scalable & Efficient HPC Systems - Evident is building its first exascale system based on its technology, aiming for high efficiency, with one model ranking number one and another in position six on the Green500 list [15] - The company is delivering approximately five exascale systems in Brazil, utilizing DDN storage solutions focused on IOPS, bandwidth, and space [16][17] Key Takeaways for Large Workloads - Large workloads like LLMs require large systems, which are prone to component failures [17][18] - Mitigating failures requires a robust design in network and storage, compatible with the number of GPUs, and the use of checkpointing [18][19]
Gryphon Digital Mining(GRYP) - 2024 Q4 - Earnings Call Transcript
2025-03-31 22:48
Gryphon Digital Mining, Inc. (NASDAQ:GRYP) Q4 2024 Earnings Conference Call March 31, 2025 4:30 PM ET Company Participants Steve Gutterman - Chief Executive Officer Sim Salzman - Chief Financial Officer Conference Call Participants Kevin Dede - H.C. Wainwright Jon Hickman - Ladenburg Thalmann Operator Greetings and welcome to the Gryphon Digital Mining Fourth Quarter and Full Year 2024 Earnings Call. On the call are Steve Gutterman, Chief Executive Officer of the company; and Sim Salzman, Chief Financial Of ...