Image missing.
AMD GPUs Go Brrr

created: Nov. 15, 2025, 2:06 a.m. | updated: Nov. 15, 2025, 2:39 p.m.

Unsurprisingly, making AMD GPUs go brr boils down to keeping the “matrix cores” (tensor cores on NVIDIA) fed. AMD offers tiny and fine-grained matrix core instructions, while NVIDIA tensor cores instructions are generally called with large input operands. AMD has a TMA-like direct global to shared memory loads via buffer_load_dword \verb|buffer_load_dword| buffer_load_dword instructions, which bypass the register file. Figure: AMD register layouts for matrix instructions are less structured. We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

3 days, 3 hours ago: Hacker News