Blog Getting Started

This post is an in depth overview on GPU architecture and how to write performant GPU code. It covers execution hierarchy, memory layout, scheduling, memory access patterns, and basic profiling. The goal is to build enough knowledge to write a SGEMM (single precision general matrix multiply) kernel that achieves 50% of theoretical GPU FLOPS.

The specifics in this guide, including naming and the specific capabilities of each SM are tailored to Nvidia’s Blackwell (GB203) generation of cards (specifically the 5070 Ti).

GPU Architecture Overview

This is a high level chart that shows the hierarchy of components in an Nvidia GPU. At the top is a GPC. The 5070 Ti includes 6 GPCs and 35 TPCs, which averages to about 6 TPCs per GPC. The distribution is intentionally uneven to accommodate performance tuning and chip layout constraints.

GPU GPC (Graphics Processing Cluster) TPC (Texture Processing Cluster) SM (Streaming Multiprocessor) SM (Streaming Multiprocessor) TPC (Texture Processing Cluster) SM (Streaming Multiprocessor) SM (Streaming Multiprocessor) L2 Cache — 48 MB, shared across all SMs Global Memory — 16GB GDDR7, off-chip DRAM

If you want to see a more comprehensive review of GPU architecture check out High Yield’s videos on YouTube. He does a great job of showing where each element is on the physical GPU die.

The purpose of the GPCs and TPCs is to organize SMs (the main compute of the GPU) into modular blocks that have their own memory, cache, instruction dispatch, and texture units.The exact organization varies by architecture. Blackwell has a different GPC/TPC ratio than Ada Lovelace, for example. Without this abstraction, there would be excessive contention for global resources and scaling the chip across product tiers would be much more difficult.

GPCs in traditional consumer GPUs also handle rasterization and graphics functions. In compute-only GPUs like the Nvidia H100, they may be optimized for throughput. For machine learning oriented workloads, this almost never comes into the picture. We’re focused entirely on the SMs.

Streaming Multiprocessors

There are a lot of individual components that make up an SM:

ElementNotesCount / Size Per SM
CUDA coresScalar ALUs that can execute one FP32 or INT32 instruction per clock cycle, per core.128
Tensor coresAccelerates small matrix multiply-accumulate ops using mixed precision (FP16, BF16, TF32).4
Special Function UnitsHandles transcendental and high-latency functions: sin, cos, exp, sqrt, etc.4
Warp schedulersManages instruction dispatch for one warp (32 threads) per cycle, directing execution to available CUDA cores.4
Load/Store unitsInterface for memory ops (load, store). Routes data to/from memory hierarchy.8
Register fileFast, per-thread memory used for all intermediate values. Like CPU registers, but all 32-bit.256 KB
Shared memory/L1 cacheLow-latency, per-SM memory. Shared memory is stored in L1 cache and is managed by the programmer.128 KB

Most if not all of the compute on a GPU is done by CUDA cores. Some mixed precision datatypes (fp16, bf16, tf32, etc) are offloaded to other units within the SM (tensor cores for example), along with all exp, sin, cos-adjacent computations (on SFUs).

__global__ void add(const float *a, const float *b, float *c) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
if (gid >= 1000) return;
c[gid] = a[gid] + b[gid];
}

Further reading

Discussion

Sign in with GitHub to comment or react. Powered by giscus.