AMD GPUs Go Brrr
created: Nov. 15, 2025, 2:06 a.m. | updated: Nov. 15, 2025, 2:39 p.m.
Unsurprisingly, making AMD GPUs go brr boils down to keeping the “matrix cores” (tensor cores on NVIDIA) fed.
AMD offers tiny and fine-grained matrix core instructions, while NVIDIA tensor cores instructions are generally called with large input operands.
AMD has a TMA-like direct global to shared memory loads via buffer_load_dword \verb|buffer_load_dword| buffer_load_dword instructions, which bypass the register file.
Figure: AMD register layouts for matrix instructions are less structured.
We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!
3 days, 3 hours ago: Hacker News