Workflow
AI大模型训练
icon
Search documents
HAMi × NVIDIA:GPU 拓扑感知调度实现详解
AI前线· 2025-10-25 05:32
Core Insights - HAMi is an active open-source project maintained by over 350 contributors from more than 15 countries, adopted by over 200 enterprises and institutions, showcasing its scalability and support capabilities [2] - The introduction of topology-aware scheduling for NVIDIA GPUs in version v2.7.0 addresses communication bottlenecks in high-performance computing (HPC) and AI model training scenarios, optimizing task deployment to enhance overall computational efficiency [2][3] Feature Overview - The core design of HAMi's topology-aware scheduling involves quantifying the physical topology into "communication scores" between devices, allowing the scheduler to make optimal decisions based on these scores [5] - Dynamic calculation of topology scores is facilitated by Device Plugin using NVML to detect physical connections between GPUs, providing a basis for scheduling decisions [6] - The scheduling process consists of two phases: topology registration, which quantifies physical connections into understandable scores, and scheduling decision-making, which selects the optimal devices based on these scores [9][10] Implementation Details - The discovery and quantification of topology information are crucial for subsequent intelligent decision-making, generating a score table for reporting [13] - The Fit function implements a dual-strategy optimization algorithm, ensuring long-term health of cluster topology resources by automatically applying "best match" and "minimal disruption" strategies for multi-GPU and single-GPU tasks respectively [6][22] Usage - Users can enable topology-aware scheduling with a simple annotation, allowing the scheduler to automatically apply the appropriate strategy based on the requested number of GPUs [25][26] - The design philosophy emphasizes dynamic discovery over static configuration and foresighted decision-making over short-sighted allocation, providing a robust GPU scheduling solution for large-scale AI training and HPC tasks in cloud-native environments [27]
中国芯片技术取得多项突破性进展
Xin Lang Cai Jing· 2025-10-18 13:27
Core Progress in China's Chip Technology - China's chip technology has achieved multiple breakthroughs, marking a shift from "single-point breakthroughs" to "systematic innovation" in the domestic semiconductor industry [1] Disruptive Computing Chips: Breaking Physical Barriers - The world's first 24-bit precision analog matrix chip developed by Peking University enhances traditional analog computing precision from 8 bits to 24 bits with an error rate below 0.1% [1] - This chip achieves a computational throughput over 1000 times that of top GPUs when solving 128×128 matrix equations, with energy efficiency improved by over 100 times [2] - It provides new pathways for AI large model training and edge computing by overcoming the century-old problem of low precision and scalability in analog computing [3] Integrated Storage and Computing Chips - Tsinghua University has developed the world's first memristor chip that integrates storage, computing, and on-chip learning, achieving a 75-fold energy efficiency improvement over traditional ASICs [4] - This chip supports direct AI training on hardware, reducing reliance on cloud services [4] Core Processes and Materials: Breaking Monopolies - The launch of a 1nm ion beam etching machine by Guoguang Liangzuo achieves a precision of 0.02 nanometers, outperforming mainstream 2nm equipment by a factor of 100 [7] - Shanghai Microelectronics has achieved mass production of immersion lithography machines, with a domestic equipment matching rate exceeding 50% [7] - Fudan University has developed the world's first two-dimensional-silicon-based hybrid architecture flash memory chip, achieving read and write speeds a million times faster than traditional flash memory [7] High-End Chip Design and Manufacturing: Entering the First Tier - Xiaomi has launched the first self-developed 3nm mobile SoC in mainland China, integrating 19 billion transistors and achieving performance close to Apple's A18 Pro with a 30% energy efficiency improvement [8] - Huawei's Ascend 910B supports 8-card interconnection, significantly reducing dependence on imported AI computing power from 95% to 50% [9] - The Loongson 3C6000 chip, based on a fully autonomous architecture, surpasses Intel's Xeon 8380 in performance and has received the highest national security certification [10] Future Directions and Challenges - A joint research project between Peking University and Hong Kong City University has developed a full-band 6G chip with a speed of 120Gbps, supporting integrated networking [11] - The introduction of a 504-qubit superconducting quantum computer "Tianyan 504" by China Telecom is expected to enhance quantum chip yield [12] - The industry still relies on EUV lithography machines for processes below 7nm, with domestic EUV expected to be developed by 2027 [13] - There is a need to accelerate the development of GPU toolchains and EDA design software to enhance the software ecosystem [14] Summary - China's chip technology is achieving "leapfrog" advancements through multi-path innovation, with short-term goals focusing on a fully autonomous 28nm supply chain, mid-term goals on reshaping computing power with new architectures, and long-term goals on seizing high ground in quantum chips and two-dimensional materials [14][15]
下一只“寒王”呼之欲出!算力+机器人共振,英伟达核心伙伴潜力股
Xin Lang Cai Jing· 2025-10-08 04:16
Group 1 - The report "Global Digital Intelligence Index 2025" predicts that by 2035, the total computing power of society will increase by 100,000 times, causing significant impact in the tech and finance sectors [1] - Computing power is considered the core productivity of the AI era, with China's intelligent computing power expected to reach 1,037.3 EFLOPS by 2025, a 43% increase from 2024, and to double to 1,460.3 EFLOPS by 2026 [2] - Major economies view computing power as a strategic resource, with the US investing $52 billion in the semiconductor industry through the CHIPS and Science Act, and the EU launching the European Chips Act to capture 20% of the global market share by 2030 [2] Group 2 - The demand for computing power is experiencing exponential growth across multiple fields, including AI model training, autonomous driving, smart cities, industrial robotics, and military applications [4] - In the context of Industry 4.0, the requirements for real-time computing power in smart manufacturing are continuously increasing [5] Group 3 - Unisoc is a leading company in the computing power sector, with its subsidiary Unisoc Xiaotong being the general agent for NVIDIA's enterprise products, providing a full-stack solution including computing, networking, storage, security, backup, and AI software [6] - Invid is another key player, supplying liquid cooling systems for data centers to IDC, with clients including Huawei and NVIDIA [6] - Industrial Fulian, a core supplier for NVIDIA, has seen rapid growth in its AI server product line, with the NVIDIA GB200 series achieving mass production [7] - Fenghuo Communication, through its subsidiary Changjiang Computing, collaborates with Ascend to provide computing infrastructure solutions, supplying products to Huawei [8] - A notable emerging company in robotics has developed inspection and cleaning robots, achieving automation in hazardous operations, and is the exclusive supplier of liquid cooling systems for Huawei's Ascend 910D chip [9]
微信WeChat-YATT横空出世,腾讯强化学习布局剑指何方
Sou Hu Cai Jing· 2025-09-24 09:56
Core Insights - Tencent's open-sourcing of WeChat-YATT training library signifies a strategic move in the competitive landscape of AI model training, particularly as OpenAI's GPT-5 approaches release [1][2] - WeChat-YATT is designed with a focus on reinforcement learning and multimodal models, differentiating itself from mainstream frameworks like TensorFlow and PyTorch [2] Group 1: WeChat-YATT's Innovations - WeChat-YATT achieves significant breakthroughs in three areas: optimized parameter update efficiency for reinforcement learning, flexible multimodal data fusion interfaces, and a modular design that lowers the barriers for distributed training [2][4] - The library's emphasis on "ease of extensibility" reflects Tencent's recognition of the need for rapid iteration in large model training [4] Group 2: Competitive Positioning - Compared to Meta's PyTorch, WeChat-YATT excels in reinforcement learning support; against Google's JAX, it shows advantages in Chinese language scenarios and multimodal processing [4] - WeChat-YATT's deep integration with the WeChat ecosystem sets it apart from similar reinforcement learning frameworks like Ray RLlib [4] Group 3: Strategic Implications - The release of WeChat-YATT aligns with Tencent's broader AI strategy, which includes trademark applications for "WeChat AI Service Platform" and the deployment of the mixed Yuan model in business scenarios [7] - Tencent aims to create a closed-loop AI ecosystem through foundational technology breakthroughs and application deployment, with WeChat-YATT serving as a critical component in this strategy [7] - The focus on reinforcement learning indicates Tencent's commitment to key areas such as gaming, recommendation systems, and autonomous driving, positioning itself for future AI applications [7] Group 4: Long-term Vision - The naming of WeChat-YATT, "Yet Another Transformer Trainer," reflects both a sense of humor and Tencent's long-term investment in AI infrastructure [6] - The competition in the era of large models is fundamentally a competition for infrastructure, with WeChat-YATT representing a piece of Tencent's broader AI blueprint [7]
提升大模型通信性能30% DeepSeek致谢腾讯大模型网络提速技术方案贡献
Shen Zhen Shang Bao· 2025-05-11 22:32
Core Insights - Tencent's technical team has optimized the DeepEP communication framework, achieving significant performance improvements in various network environments, with a 100% enhancement in RoCE and a 30% enhancement in IB networks, facilitating more efficient AI large model training solutions [2][3] - The optimization addresses key bottlenecks in the original DeepEP framework, particularly in bandwidth utilization and CPU control delays, which were limiting its broader application [2][3] Group 1 - The optimization includes intelligent bandwidth allocation through topology-aware multi-QP chaining technology, ensuring full utilization of dual-port network card bandwidth and preventing bandwidth waste [3] - Tencent has resolved CPU control bottlenecks in GPU communication by optimizing the control plane operations to bypass CPU intermediaries, reducing latency and energy consumption [3] - A new "QP internal sequencing lock" mechanism has been introduced to ensure accurate and sequential data transmission among multiple GPUs, even when handling over 1,000 simultaneous data transfer tasks [3] Group 2 - The optimized DeepEP framework has been fully open-sourced and successfully applied in Tencent's mixed Yuan large model training and inference projects, demonstrating excellent versatility in high-performance environments built with Tencent's Xingmai and H20 servers [3]
DeepSeek致谢腾讯技术团队:对DeepEP的优化,是一次“huge speedup”代码贡献
Xin Lang Ke Ji· 2025-05-07 11:12
Core Insights - Tencent's technical team has optimized the DeepEP communication framework, achieving significant performance improvements across various network environments, with a 100% performance increase in RoCE networks and a 30% increase in IB networks, enhancing AI large model training efficiency [1][2] Group 1: Technical Enhancements - The optimization involved replacing IBRC with IBGDA and utilizing distinct Queue Pairs (QPs) per channel for parallel data transmission, which improved the robustness and communication performance of the normal kernels [1] - The algorithm bandwidth for the optimized framework reached 58 GB/s in RDMA scenarios, with physical bandwidth calculated at 43.5 GB/s [1] Group 2: Industry Impact - Since the open-sourcing of DeepSeek, including DeepEP, in February, the framework has demonstrated a 300% increase in communication efficiency, addressing the dependency on NVIDIA NCCL for MoE architecture large models [2] - The optimizations have been successfully applied in Tencent's mixed Yuan model projects, showcasing excellent versatility in high-performance environments built with Tencent's Starry Network and H20 servers [2]
技术驱动与绿色转型双轮并进,润泽科技一季报稳健增长
Core Insights - The company reported a revenue of 1.198 billion yuan and a net profit of 430 million yuan for Q1 2025, indicating healthy financial metrics [1] - As a leading provider of intelligent computing infrastructure in China, the company is leveraging technological innovation and green development to build a future-oriented computing foundation [1] - The company has established seven AIDC intelligent computing clusters across key economic regions, with all delivered and upcoming computing centers having secured production orders, expected to be operational by 2025 [1] Technological Developments - The company is deepening the commercialization of liquid cooling technology, having delivered the industry's first fully liquid-cooled green computing center in 2023 [1] - The Power Usage Effectiveness (PUE) of the liquid-cooled computing centers has been reduced to approximately 1.15, showcasing significant energy efficiency [1] - The company is enhancing energy-saving renovations in existing computing centers and has achieved industry-leading PUE levels in its Langfang park, supporting AI model training with reliable and efficient computing infrastructure [1] Green Development Strategy - The company is actively promoting a "low-carbon green" process for its computing centers, with its A-7 and A-18 centers recognized as national green data centers due to their excellent energy-saving performance [2] - In 2024, the company completed a total of 800 million kilowatt-hours in green electricity transactions, emphasizing its commitment to energy-saving technology research and green transformation [2] Strategic Expansion - The company's strategic layout in Hainan Free Trade Port aligns with national policies, as the State Council approved the establishment of cross-border e-commerce comprehensive pilot zones in Hainan and other cities [3] - The company is constructing an intelligent computing infrastructure cluster in Danzhou, Hainan, with a planned capacity of approximately 30,000 cabinets, aimed at enhancing cross-border operations [3] - This initiative supports the digital economy development directive outlined in the Hainan Free Trade Port construction plan and lays the groundwork for the company to expand into overseas markets [3]