GPU集群

Search documents
什么是Scale Up和Scale Out?
半导体行业观察· 2025-05-23 01:21
Core Viewpoint - The article discusses the concepts of horizontal and vertical scaling in GPU clusters, particularly in the context of AI Pods, which are modular infrastructure solutions designed to streamline AI workload deployment [2][4]. Group 1: AI Pod and Scaling Concepts - AI Pods integrate computing, storage, networking, and software components into a cohesive unit for efficient AI operations [2]. - Vertical scaling (Scale-Up) involves adding more resources (like processors and memory) to a single AI Pod, while horizontal scaling (Scale-Out) involves adding more AI Pods and connecting them [4][8]. - XPU is a general term for any type of processing unit, which can include various architectures such as CPUs, GPUs, and ASICs [6][5]. Group 2: Advantages and Disadvantages of Scaling - Vertical scaling is straightforward and allows for leveraging powerful server hardware, making it suitable for applications with high memory or processing demands [9][8]. - However, vertical scaling has limitations due to physical hardware constraints, leading to potential bottlenecks in performance [8]. - Horizontal scaling offers long-term scalability and flexibility, allowing for easy reduction in scale when demand decreases [12][13]. Group 3: Communication and Networking - Communication within and between AI Pods is crucial, with pod-to-pod communication typically requiring low latency and high bandwidth [11]. - InfiniBand and Super Ethernet are key competitors in the field of inter-pod and data center architecture, with InfiniBand being a long-standing standard for low-latency, high-bandwidth communication [13].
百度(09888.HK)宣布成功建立了由3万个自主研发的昆仑芯片组成的GPU集群,足以支持大语言模型的训练。
news flash· 2025-04-25 03:07
Core Viewpoint - Baidu has successfully established a GPU cluster composed of 30,000 self-developed Kunlun chips, sufficient to support the training of large language models [1] Company Summary - The GPU cluster consists of 30,000 Kunlun chips, indicating Baidu's significant investment in AI infrastructure [1] - This development positions Baidu to enhance its capabilities in training large language models, which is crucial for advancing its AI initiatives [1] Industry Summary - The establishment of such a large GPU cluster reflects the growing demand for advanced computing power in the AI industry [1] - Companies in the AI sector are increasingly focusing on developing proprietary hardware to support their machine learning and AI model training needs [1]