Workflow
Checkpointing
icon
Search documents
Achieving Success for HPC and AI-Driven Business Outcomes - Paul Bloch, DDN
DDN· 2025-09-18 15:10
DDN's Market Positioning & Strategy - DDN is recognized as a key player in high-performance computing, particularly by Nvidia, who has been using DDN exclusively for the past eight years [1] - DDN's technology is integral to Nvidia's testing and development, including platforms like Selene A100, Eos H100 (4,000 GPUs), and GB200 [1] - DDN emphasizes its ability to scale solutions from small implementations (2U) to massive deployments (100,000+ GPUs), validated at 100% [1] - DDN focuses on investing in R&D, engineering talent, and feature development for both Exascaler and Infinia, reinvesting customer dollars back into the company [2] Technological Advantages & Solutions - DDN's solutions offer better GPU efficiency through checkpointing, data loading, and data crunching, with significantly faster write performance compared to competitors [2] - DDN's architecture simplifies deployments with fewer network ports, enhancing stability and scalability, avoiding full mesh requirements seen in competing solutions [1] - DDN provides online upgrades and enhanced visibility into workload and potential issues at the cluster level, extending beyond storage to include network and GPU monitoring [2] - DDN's systems are fully balanced, ensuring that performance scales linearly with added units, aggregating performance and access as the system expands [2] Customer Success & Partnerships - Jump Trading, a high-frequency trading firm, deployed half an exabyte of DDN's platform after switching from competitive technologies [2] - DDN is partnering with Nvidia cloud providers (NCPs) to deliver AI in the cloud as a private cloud solution, offering control over data and latency [2] - Scaleway, an NCP, has found that DDN maintains consistent performance at scale, without issues related to metadata or object size limitations [2] Addressing Industry Trends - The industry is experiencing an accelerated pace of technology change, with new chips emerging every six months to a year, requiring faster time to data, resolution, and production [1] - The scale of deployments is increasing rapidly, with discussions now commonly involving 100,000 to 500,000 GPUs, requiring infrastructure that can handle this scale [1] - Customers demand rapid deployment, expecting systems to be up and running within 60 days or less, emphasizing the need for quick time to results [1]
Designing Resilient AI & HPC Systems: Insights from Eviden's Global Deployments
DDN· 2025-07-14 13:40
Company Overview & HPC Leadership - Evident, a spin-off of ATOS, specializes in high-performance computing (HPC) solutions, including mainframes, processors, and high-speed networks [2] - The company positions itself as the largest HPC provider in Europe and a significant provider in Latin America and India [2][3] - Evident is a partner with DDN, utilizing DDN as its back end for most of its delivered systems [2] HPC & AI Integration - Evident is bringing HPC to AI, exemplified by a cluster of approximately 40 DGX units implemented in Ecuador for computer vision, processing 5 gigabytes per second of video stream from over 18,000 cameras [4][5] - The Ecuador system utilizes a 31 petabyte Luster file system and a 21 petabyte system based on S400 and X2, chosen for its fast failover and recovery capabilities [6] Storage Challenges & Solutions - The presentation highlights challenges related to storage, including inefficient small file storage (7KB) leading to performance issues [7][8] - Another challenge involved seismic image processing requiring 900 gigabytes of data reads and writes, which was addressed using local caching [9][10] Large Language Models (LLMs) & System Failures - Meta's model training, involving an 800 gigabyte model and a cluster of 25,000 GPUs (using only 16 initially), faced frequent failures (419 failures in 45 days, approximately one failure every 3 hours) [11][12] - Checkpointing is deemed a necessity due to the high failure rate of components (GPUs, interconnects, power supplies, software bugs) in large-scale systems [13][14] Scalable & Efficient HPC Systems - Evident is building its first exascale system based on its technology, aiming for high efficiency, with one model ranking number one and another in position six on the Green500 list [15] - The company is delivering approximately five exascale systems in Brazil, utilizing DDN storage solutions focused on IOPS, bandwidth, and space [16][17] Key Takeaways for Large Workloads - Large workloads like LLMs require large systems, which are prone to component failures [17][18] - Mitigating failures requires a robust design in network and storage, compatible with the number of GPUs, and the use of checkpointing [18][19]