Workflow
Storage
icon
Search documents
Designing Resilient AI & HPC Systems: Insights from Eviden's Global Deployments
DDN· 2025-07-14 13:40
Company Overview & HPC Leadership - Evident, a spin-off of ATOS, specializes in high-performance computing (HPC) solutions, including mainframes, processors, and high-speed networks [2] - The company positions itself as the largest HPC provider in Europe and a significant provider in Latin America and India [2][3] - Evident is a partner with DDN, utilizing DDN as its back end for most of its delivered systems [2] HPC & AI Integration - Evident is bringing HPC to AI, exemplified by a cluster of approximately 40 DGX units implemented in Ecuador for computer vision, processing 5 gigabytes per second of video stream from over 18,000 cameras [4][5] - The Ecuador system utilizes a 31 petabyte Luster file system and a 21 petabyte system based on S400 and X2, chosen for its fast failover and recovery capabilities [6] Storage Challenges & Solutions - The presentation highlights challenges related to storage, including inefficient small file storage (7KB) leading to performance issues [7][8] - Another challenge involved seismic image processing requiring 900 gigabytes of data reads and writes, which was addressed using local caching [9][10] Large Language Models (LLMs) & System Failures - Meta's model training, involving an 800 gigabyte model and a cluster of 25,000 GPUs (using only 16 initially), faced frequent failures (419 failures in 45 days, approximately one failure every 3 hours) [11][12] - Checkpointing is deemed a necessity due to the high failure rate of components (GPUs, interconnects, power supplies, software bugs) in large-scale systems [13][14] Scalable & Efficient HPC Systems - Evident is building its first exascale system based on its technology, aiming for high efficiency, with one model ranking number one and another in position six on the Green500 list [15] - The company is delivering approximately five exascale systems in Brazil, utilizing DDN storage solutions focused on IOPS, bandwidth, and space [16][17] Key Takeaways for Large Workloads - Large workloads like LLMs require large systems, which are prone to component failures [17][18] - Mitigating failures requires a robust design in network and storage, compatible with the number of GPUs, and the use of checkpointing [18][19]
算力依然高景气!关注半导体芯片机遇
Mei Ri Jing Ji Xin Wen· 2025-07-09 01:20
Group 1 - The semiconductor chip sector is experiencing a strong performance due to increased investments in artificial intelligence (AI) by cloud service providers, leading to a favorable supply-demand environment in computing power and storage [1][3] - Industrial Fulian announced an expected net profit of 6.727 billion to 6.927 billion yuan for Q2 2025, representing a year-on-year increase of 47.72% to 52.11%, driven by rapid growth in its cloud computing business [3] - Major cloud providers like Microsoft and Google are projected to increase capital expenditures by over 30% year-on-year in 2025, while domestic companies like Alibaba and Tencent are expected to exceed 120 billion yuan and 80 billion yuan, respectively [3] Group 2 - The demand for storage products is rising due to the upgrade of storage capacities driven by AI servers, PCs, and mobile phones, with significant order increases reported by multiple storage manufacturers [5] - DRAM contract prices have surged significantly since May, and this upward trend is expected to continue into Q3/Q4 due to supply-demand dynamics [5] - Several A-share semiconductor companies are planning to list in Hong Kong, which may enhance their access to international capital markets and expand overseas production and sales [8] Group 3 - The communication ETF (515880) saw a rise of 4.65% on July 8, reflecting positive market sentiment towards AI and semiconductor-related stocks [2][3] - The AI server revenue for Industrial Fulian increased by over 60% year-on-year, while revenue from cloud service provider servers surged by more than 150% [3] - The domestic largest DRAM manufacturer, Changxin Storage, has initiated listing guidance, aiming to raise funds to expand production capacity and enhance domestic production rates [5]
Accelerating AI Storage with NVIDIA SpectrumX & DDN
DDN· 2025-06-20 11:24
AI Infrastructure & Performance - Slow storage significantly hinders AI training and inference speeds, impacting GPU utilization [2] - Fast storage is crucial for various AI applications and other accelerated workloads [3] - Spectrum X technology, initially designed for GPU-to-GPU communication, is now being adapted to accelerate storage traffic [4][5] - Spectrum X improves GPU storage bandwidth by approximately 50% and enhances performance in noisy environments [6] Technical Innovations & Solutions - Traditional Ethernet struggles with large data flows ("elephant flows") due to flow-by-flow load balancing, leading to ECMP collisions [7][8] - Spectrum X employs packet-by-packet load balancing to achieve optimal fabric utilization, requiring a full-stack solution with technology in storage appliances, GPU servers, and switches to handle out-of-order packets [8][9] - Spectrum X addresses incast congestion issues arising from multiple GPUs sending data to storage or vice versa [10][11] - The technology mitigates performance degradation caused by link failures in large-scale deployments [12][13][14] Testing & Validation - Nvidia uses its supercomputer, Israel 1, as a proving ground for Spectrum X development and testing, including storage applications [18][19] - Tests on Israel 1, involving 300 GPUs across four scalable units, demonstrated that Spectrum X accelerates write performance by nearly 50% compared to Rocky [20][21][23] - DDN validated Spectrum X with their full stack, publishing a white paper and technical blog on the results [24] Visibility & Management - Spectrum X provides enhanced visibility into the entire fabric, enabling partners like DDN to monitor and predict potential issues using APIs [17]
ONEOK (OKE) Fireside Chat Transcript
2025-05-28 19:30
Summary of ONEOK (OKE) Fireside Chat - May 28, 2025 Company Overview - **Company**: ONEOK (OKE) - **Industry**: Energy and Natural Gas Key Points and Arguments Commodity Price Impact - ONEOK is less affected by commodity price fluctuations compared to other companies, which can see cash flow changes of up to 40% with a $1 change in gas or a $10 change in oil prices [4][6] - Recent oil price declines have not led to a decrease in volume for ONEOK, indicating stability in their operations [4][6] Natural Gas Outlook - Natural gas production in the U.S. has increased from approximately 20 trillion cubic feet (Tcf) in 2000-2007 to around 42 Tcf in 2024, driven by coal-to-gas conversions and LNG exports [8][9] - Current LNG export facilities are projected to increase capacity to 10 billion cubic feet (Bcf) per day, with potential future expansions [10] - The demand for natural gas is expected to grow due to factors such as artificial intelligence data centers and ongoing coal plant conversions [11] Natural Gas Liquids (NGLs) and Petrochemicals - NGLs are primarily byproducts of crude oil and natural gas production, and their value is dependent on transportation to markets where they can be sold at higher prices [12][13] - The U.S. is expected to remain a significant supplier of ethane to petrochemical companies, with a notable portion being exported to China [18][20] Regulatory Environment - The current administration is actively seeking specific feedback from the industry to improve regulatory processes, which is seen as a positive change [22] - Tariff policies are viewed as volatile, but there is a growing understanding that they may not be permanent [23][24] Strategic Focus and Growth - ONEOK's strategy emphasizes brownfield expansions to reduce capital costs and enhance integration within their existing systems [26][27] - The company has divested non-integrated assets to focus on core business areas, which has allowed for better capital allocation [28] Financial Strategy - ONEOK aims for a balanced capital allocation strategy, prioritizing organic growth projects while maintaining a strong dividend policy [75][76] - The company targets a debt-to-EBITDA ratio of around 3.5 times, with plans for stock buybacks if excess cash is generated [78][79] Storage and Volatility Management - Storage capacity is seen as a critical component for managing the volatility of natural gas pricing, especially with increasing LNG exports [37][44] - ONEOK is expanding its storage capabilities, which are expected to provide opportunities in a volatile market [47][59] Customer Diversification - The company is shifting from a supply-push model to a demand-pull model, diversifying its customer base and reducing reliance on specific markets [51][52] Long-term Outlook - The Bakken region is expected to sustain production levels for decades, supported by advancements in drilling technology [54] - ONEOK anticipates continued growth in its core business, driven by synergies from recent acquisitions and ongoing demand for natural gas and NGLs [83][84] Additional Important Insights - The integration of various assets is a key focus for ONEOK, as it allows for better control over revenue streams and operational efficiencies [91][94] - The company is positioned well in the LNG market, with the U.S. being a significant player in global exports, particularly to Asia [71][72] This summary captures the essential insights from the ONEOK Fireside Chat, highlighting the company's strategic focus, market outlook, and financial strategies.