Consider 8 AVX512 registers containing rows of a matrix, so that each 64-bit lane is a cell of an 8x8 matrix. How to transpose such a matrix in C/C++?
What I have tried so far: 8 _mm512_i32scatter_epi64() intrinsics. It's extremely slow and uses the cache heavily so that other CPU cores can't get enough L3 bandwidth.
If the values are originally in RAM, try _mm512_i32gather_epi64 instead. Simultaneous reading from cache could lead to less stalling than writing to it.
If the destination is memory, dummy writing zeroes to a cache-line-aligned destination area could ensure that the instructions don't stall waiting on cache. With the scatter instructions the CPU doesn't know that whole destination area will be overwritten, so it has to wait for loading of the previous values.
If the values are already in registers and the goal is to keep them in registers, _mm512_permutex2var_pd seems like the most useful instruction. It lets to pick individual elements by index from two sources. But because values are distributed across so many registers, shuffling them into place will take a lot of operations.
I'll mark the matrix rows A-H and columns 0-7. Initially each register will contain one row, which I'll mark e.g. A0-7. Goal is to have each register contain one column, marked e.g. A-H0.
A0 .. A7 .. .. H0 .. H7
The best sequence I found is still 24 shuffles:
Do 8 shuffles on pairs of registers in blocks of 4:
A0-7 and B0-7 to A-B0-3A0-7 and B0-7 to A-B4-7G0-7 and H0-7 to G-H4-7Do 8 shuffles in blocks of 2:
A-B0-3 and C-D0-3 to A-D0-1A-B0-3 and C-D0-3 to A-D2-3E-F4-7 and G-H4-7 to E-H6-7Do 8 shuffles in blocks of 1:
A-D0-1 and E-H0-1 to A-H0A-D0-1 and E-H0-1 to A-H1A-D6-7 and E-H6-7 to A-H7I think this is the minimum possible number of shuffles when operating only in registers. Each AVX512 instruction takes at most two registers as input, and this sequence takes full advantage of the inputs and outputs to gather data from the 8 source registers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With