Image missing.
Highly efficient matrix transpose in Mojo

created: June 6, 2025, 7:28 p.m. | updated: June 7, 2025, 7:51 p.m.

Highly efficient matrix transpose in Mojo 🔥06 Jun, 2025In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the Hopper architecture using Mojo. grid_constant ` ) fn transpose_kernel_naive [ block_size : Int ]( descriptor : TMADescriptor , descriptor_tr : TMADescriptor ): var shmem = stack_allocation [ block_size * block_size , DType . x % block_size row = thread_idx . Compute transpose in shared memory# COMPUTE shmem_tr [ col * block_size + row ] = shmem [ row * block_size + col ] # FENCE barrier () tma_store_fence ()We compute the transpose using our two arrays. SWIZZLE_128B ]( gmem_tr_dev , ( GMEM_WIDTH , GMEM_HEIGHT ), ( GMEM_HEIGHT , 1 ), ( SMEM_WIDTH , SMEM_HEIGHT ), )We can compute swizzled indices like this:fn calculate_row_swizzle [ block_size : Int ]( col : Int , row : Int ) -> Int : i16_tr = ( col * BLOCK_SIZE + row ) * 4 >> 4 y16_tr = i16_tr >> 3 x16_tr = i16_tr & 7 x16_swz_tr = y16_tr ^ x16_tr return (( x16_swz_tr * 4 ) & ( BLOCK_SIZE - 1 )) + ( row & 3 ) fn calculate_col_swizzle [ block_size : Int ]( col : Int , row : Int ) -> Int : i16 = ( row * BLOCK_SIZE + col ) * 4 >> 4 y16 = i16 >> 3 x16 = i16 & 7 x16_swz = y16 ^ x16 return (( x16_swz * 4 ) & ( block_size - 1 )) + ( col & 3 )and than use the swizzled indices inside our kernel like so:col_swizzle = calculate_col_swizzle [ block_size ]( col , row ) row_swizzle = calculate_row_swizzle [ block_size ]( col , row ) ... # COMPUTE shmem_tr [ col * block_size + row_swizzle ] = shmem [ row * block_size + col_swizzle ]Everything else is exactly the same.

1 day, 6 hours ago: Hacker News