特斯拉,超详细解读Dojo芯片

Core Insights - Tesla has developed a Stress tool to detect and disable faulty cores on its Dojo processors, which is crucial as a single silent data corruption (SDC) error can ruin weeks of AI training [1][3] - The Dojo processor is one of the largest in the world, utilizing 300mm wafers and housing up to 8,850 cores per chip, making it challenging to detect defects during manufacturing [1][5] Technical Details - Each Dojo Training Tile consists of 25 D1 chips, each with 354 custom 64-bit RISC-V cores and 1.25 MB SRAM, organized in a 5x5 cluster with a mechanical network interconnect providing 10 TB/s bandwidth [5] - The power consumption of the Dojo processors is significant, with current draw reaching 18,000 amperes and power consumption at 15,000 watts, which complicates the detection of SDC [3] Fault Detection Methodology - Tesla initially used differential fuzz testing to identify faulty cores but improved the method by assigning unique payloads to each core, allowing for faster testing without communication overhead [7] - The enhanced method allows cores to run multiple payloads without resetting, increasing the likelihood of detecting subtle errors [7] - The Stress tool operates independently of the core, enabling background testing without taking cores offline, and only faulty cores are disabled [9] Findings and Improvements - The Stress tool has identified numerous defective cores within the Dojo cluster, with detection times varying significantly based on the payload size executed [9] - The tool has also uncovered rare design-level defects, which were resolved through software adjustments, indicating its effectiveness in monitoring hardware health [11] Future Plans - Tesla plans to leverage data from the Stress tool to study long-term performance degradation due to aging and intends to extend this testing methodology to pre-production stages [13] - The company aims to identify potential SDC issues before production, although this presents challenges due to the nature of aging-related defects [13] Industry Context - The development and manufacturing of wafer-scale processors are complex, with only a few companies like Tesla and Cerebras achieving this feat [15] - TSMC, the manufacturer of these processors, anticipates that more companies will adopt wafer-scale designs in the coming years, indicating a growing trend in the industry [15]