However Briefly

AMD GPUs Go Brrr

created: Nov. 15, 2025, 2:06 a.m. | updated: Nov. 15, 2025, 2:39 p.m.

Unsurprisingly, making AMD GPUs go brr boils down to keeping the “matrix cores” (tensor cores on NVIDIA) fed. AMD offers tiny and fine-grained matrix core instructions, while NVIDIA tensor cores instructions are generally called with large input operands. AMD has a TMA-like direct global to shared memory loads via buffer_load_dword \verb|buffer_load_dword| buffer_load_dword instructions, which bypass the register file. Figure: AMD register layouts for matrix instructions are less structured. We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

Read Full Article

3 weeks, 5 days ago: Hacker News