Workflow
Tensor Cores
icon
Search documents
What every AI engineer needs to know about GPUs โ€” Charles Frye, Modal
AI Engineerยท 2025-07-20 07:00
AI Engineering & GPU Utilization - AI engineering is shifting towards tighter integration and self-hosting of language models, increasing the need to understand GPU hardware [6][7] - The industry should focus on high bandwidth, not low latency, when utilizing GPUs [8] - GPUs optimize for math bandwidth over memory bandwidth, emphasizing computational operations [9] - Low precision matrix matrix multiplications are key to fully utilizing GPU potential [10] - Tensor cores, specialized for low precision matrix matrix multiplication, are crucial for efficient GPU usage [6][37] Hardware & Performance - GPUs achieve parallelism significantly exceeding CPUs, with the Nvidia H100 SXM GPU capable of over 16,000 parallel threads at 5 cents per thread, compared to AMD Epic CPU's two threads per core at approximately 1 watt per thread [20][21] - GPUs offer faster context switching compared to CPUs, happening every clock cycle [23] - Bandwidth improvement increases at the square of latency improvement, favoring bandwidth-oriented hardware [25][26] Model Optimization - Small models can be more hardware-sympathetic, potentially matching the quality of larger models with techniques like verification and multiple generations [32][33] - Multi-token prediction and multi-sample queries can become nearly "free" due to tensor core capabilities [36] - Generating multiple samples or tokens can improve performance by leveraging matrix matrix operations [39]