Blog Getting Started
an example
This post is an in depth overview on GPU architecture and how to write performant GPU code. It covers execution hierarchy, memory layout, scheduling, memory access patterns, and basic profiling. The goal is to build enough knowledge to write a SGEMM (single precision general matrix multiply) kernel that achieves 50% of theoretical GPU FLOPS.
The specifics in this guide, including naming and the specific capabilities of each SM are tailored to Nvidia’s Blackwell (GB203) generation of cards (specifically the 5070 Ti).
GPU Architecture Overview
This is a high level chart that shows the hierarchy of components in an Nvidia GPU. At the top is a GPC. The 5070 Ti includes 6 GPCs and 35 TPCs, which averages to about 6 TPCs per GPC. The distribution is intentionally uneven to accommodate performance tuning and chip layout constraints.
If you want to see a more comprehensive review of GPU architecture check out High Yield’s videos on YouTube. He does a great job of showing where each element is on the physical GPU die.
The purpose of the GPCs and TPCs is to organize SMs (the main compute of the GPU) into modular blocks that have their own memory, cache, instruction dispatch, and texture units.The exact organization varies by architecture. Blackwell has a different GPC/TPC ratio than Ada Lovelace, for example. Without this abstraction, there would be excessive contention for global resources and scaling the chip across product tiers would be much more difficult.
GPCs in traditional consumer GPUs also handle rasterization and graphics functions. In compute-only GPUs like the Nvidia H100, they may be optimized for throughput. For machine learning oriented workloads, this almost never comes into the picture. We’re focused entirely on the SMs.
Streaming Multiprocessors
There are a lot of individual components that make up an SM:
| Element | Notes | Count / Size Per SM |
|---|---|---|
| CUDA cores | Scalar ALUs that can execute one FP32 or INT32 instruction per clock cycle, per core. | 128 |
| Tensor cores | Accelerates small matrix multiply-accumulate ops using mixed precision (FP16, BF16, TF32). | 4 |
| Special Function Units | Handles transcendental and high-latency functions: sin, cos, exp, sqrt, etc. | 4 |
| Warp schedulers | Manages instruction dispatch for one warp (32 threads) per cycle, directing execution to available CUDA cores. | 4 |
| Load/Store units | Interface for memory ops (load, store). Routes data to/from memory hierarchy. | 8 |
| Register file | Fast, per-thread memory used for all intermediate values. Like CPU registers, but all 32-bit. | 256 KB |
| Shared memory/L1 cache | Low-latency, per-SM memory. Shared memory is stored in L1 cache and is managed by the programmer. | 128 KB |
Most if not all of the compute on a GPU is done by CUDA cores. Some mixed precision datatypes (fp16, bf16, tf32, etc) are offloaded to other units within the SM (tensor cores for example), along with all exp, sin, cos-adjacent computations (on SFUs).
__global__ void add(const float *a, const float *b, float *c) { int gid = blockIdx.x * blockDim.x + threadIdx.x; if (gid >= 1000) return; c[gid] = a[gid] + b[gid];}
Discussion
Sign in with GitHub to comment or react. Powered by giscus.