I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations.
Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how can I transpose the matrix with purely SIMD instructions?
There're already solutions to transposing 4x4 or 8x8 matrices with SSE and AVX2 respectively. But I couldn't figure out how to extend it to 16x16 with AVX-512.
Any ideas?
To calculate the transpose of a matrix, simply interchange the rows and columns of the matrix i.e. write the elements of the rows as columns and write the elements of a column as rows.
The transpose of a matrix is obtained by moving the rows data to the column and columns data to the rows. If we have an array of shape (X, Y) then the transpose of the array will have the shape (Y, X).
For two operand instructions using SIMD you can show that the number of operations necessary to transpose a nxn
matrix is n*log_2(n)
whereas using scalar operations it's O(n^2)
. In fact, later I'll show that the number of read and write operations using the scalar registers is 2*n*(n-1)
. Below is a table showing the number of operations to transpose 4x4
, 8x8
, 16x16
, and 32x32
matrices using SSE, AVX, AVX512, and AVX1024 compared to the scalar operations
n 4(SSE) 8(AVX) 16(AVX512) 32(AVX1024)
SIMD ops 8 24 64 160
SIMD +r/w ops 16 40 96 224
Scalar r/w ops 24 112 480 1984
where SIMD +r/w ops includes the read and write operations (n*log_2(n) + 2*n
).
The reason the SIMD transpose can be done in n*log_2(n)
operations is that the algorithm is:
permute n 32-bit rows
permute n 64-bit rows
...
permute n simd_width/2-bit rows
For example, for 4x4
there are 4 rows and therefore you have to permute 32-bit lanes 4 times and then 64-bit lanes 4 times. For 16x16
you have to permute 32-bit lanes , 64-bit lanes, 128-bit lanes, and finally 256-lanes 16 times for each.
I already showed that 8x8
can be done with 24 operations with AVX. So the question is how to do this for 16x16
using AVX512 in 64 operations? The general algorithm is:
interleave 32-bit lanes using
8x _mm512_unpacklo_epi32
8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
8x _mm512_unpacklo_epi64
8x _mm512_unpackhi_epi64
permute 128-bit lanes using
16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
16x _mm512_shuffle_i32x4
Here is untested code doing this
//given __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;
__m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;
t0 = _mm512_unpacklo_epi32(r0,r1); // 0 16 1 17 4 20 5 21 8 24 9 25 12 28 13 29
t1 = _mm512_unpackhi_epi32(r0,r1); // 2 18 3 19 6 22 7 23 10 26 11 27 14 30 15 31
t2 = _mm512_unpacklo_epi32(r2,r3); // 32 48 33 49 ...
t3 = _mm512_unpackhi_epi32(r2,r3); // 34 50 35 51 ...
t4 = _mm512_unpacklo_epi32(r4,r5); // 64 80 65 81 ...
t5 = _mm512_unpackhi_epi32(r4,r5); // 66 82 67 83 ...
t6 = _mm512_unpacklo_epi32(r6,r7); // 96 112 97 113 ...
t7 = _mm512_unpackhi_epi32(r6,r7); // 98 114 99 115 ...
t8 = _mm512_unpacklo_epi32(r8,r9); // 128 ...
t9 = _mm512_unpackhi_epi32(r8,r9); // 130 ...
ta = _mm512_unpacklo_epi32(ra,rb); // 160 ...
tb = _mm512_unpackhi_epi32(ra,rb); // 162 ...
tc = _mm512_unpacklo_epi32(rc,rd); // 196 ...
td = _mm512_unpackhi_epi32(rc,rd); // 198 ...
te = _mm512_unpacklo_epi32(re,rf); // 228 ...
tf = _mm512_unpackhi_epi32(re,rf); // 230 ...
r0 = _mm512_unpacklo_epi64(t0,t2); // 0 16 32 48 ...
r1 = _mm512_unpackhi_epi64(t0,t2); // 1 17 33 49 ...
r2 = _mm512_unpacklo_epi64(t1,t3); // 2 18 34 49 ...
r3 = _mm512_unpackhi_epi64(t1,t3); // 3 19 35 51 ...
r4 = _mm512_unpacklo_epi64(t4,t6); // 64 80 96 112 ...
r5 = _mm512_unpackhi_epi64(t4,t6); // 65 81 97 114 ...
r6 = _mm512_unpacklo_epi64(t5,t7); // 66 82 98 113 ...
r7 = _mm512_unpackhi_epi64(t5,t7); // 67 83 99 115 ...
r8 = _mm512_unpacklo_epi64(t8,ta); // 128 144 160 176 ...
r9 = _mm512_unpackhi_epi64(t8,ta); // 129 145 161 178 ...
ra = _mm512_unpacklo_epi64(t9,tb); // 130 146 162 177 ...
rb = _mm512_unpackhi_epi64(t9,tb); // 131 147 163 179 ...
rc = _mm512_unpacklo_epi64(tc,te); // 192 208 228 240 ...
rd = _mm512_unpackhi_epi64(tc,te); // 193 209 229 241 ...
re = _mm512_unpacklo_epi64(td,tf); // 194 210 230 242 ...
rf = _mm512_unpackhi_epi64(td,tf); // 195 211 231 243 ...
t0 = _mm512_shuffle_i32x4(r0, r4, 0x88); // 0 16 32 48 8 24 40 56 64 80 96 112 ...
t1 = _mm512_shuffle_i32x4(r1, r5, 0x88); // 1 17 33 49 ...
t2 = _mm512_shuffle_i32x4(r2, r6, 0x88); // 2 18 34 50 ...
t3 = _mm512_shuffle_i32x4(r3, r7, 0x88); // 3 19 35 51 ...
t4 = _mm512_shuffle_i32x4(r0, r4, 0xdd); // 4 20 36 52 ...
t5 = _mm512_shuffle_i32x4(r1, r5, 0xdd); // 5 21 37 53 ...
t6 = _mm512_shuffle_i32x4(r2, r6, 0xdd); // 6 22 38 54 ...
t7 = _mm512_shuffle_i32x4(r3, r7, 0xdd); // 7 23 39 55 ...
t8 = _mm512_shuffle_i32x4(r8, rc, 0x88); // 128 144 160 176 ...
t9 = _mm512_shuffle_i32x4(r9, rd, 0x88); // 129 145 161 177 ...
ta = _mm512_shuffle_i32x4(ra, re, 0x88); // 130 146 162 178 ...
tb = _mm512_shuffle_i32x4(rb, rf, 0x88); // 131 147 163 179 ...
tc = _mm512_shuffle_i32x4(r8, rc, 0xdd); // 132 148 164 180 ...
td = _mm512_shuffle_i32x4(r9, rd, 0xdd); // 133 149 165 181 ...
te = _mm512_shuffle_i32x4(ra, re, 0xdd); // 134 150 166 182 ...
tf = _mm512_shuffle_i32x4(rb, rf, 0xdd); // 135 151 167 183 ...
r0 = _mm512_shuffle_i32x4(t0, t8, 0x88); // 0 16 32 48 64 80 96 112 ... 240
r1 = _mm512_shuffle_i32x4(t1, t9, 0x88); // 1 17 33 49 66 81 97 113 ... 241
r2 = _mm512_shuffle_i32x4(t2, ta, 0x88); // 2 18 34 50 67 82 98 114 ... 242
r3 = _mm512_shuffle_i32x4(t3, tb, 0x88); // 3 19 35 51 68 83 99 115 ... 243
r4 = _mm512_shuffle_i32x4(t4, tc, 0x88); // 4 ...
r5 = _mm512_shuffle_i32x4(t5, td, 0x88); // 5 ...
r6 = _mm512_shuffle_i32x4(t6, te, 0x88); // 6 ...
r7 = _mm512_shuffle_i32x4(t7, tf, 0x88); // 7 ...
r8 = _mm512_shuffle_i32x4(t0, t8, 0xdd); // 8 ...
r9 = _mm512_shuffle_i32x4(t1, t9, 0xdd); // 9 ...
ra = _mm512_shuffle_i32x4(t2, ta, 0xdd); // 10 ...
rb = _mm512_shuffle_i32x4(t3, tb, 0xdd); // 11 ...
rc = _mm512_shuffle_i32x4(t4, tc, 0xdd); // 12 ...
rd = _mm512_shuffle_i32x4(t5, td, 0xdd); // 13 ...
re = _mm512_shuffle_i32x4(t6, te, 0xdd); // 14 ...
rf = _mm512_shuffle_i32x4(t7, tf, 0xdd); // 15 31 47 63 79 96 111 127 ... 255
I got the idea for using _mm512_shufflei32x4
by looking at transposing a 4x4
matrix using _mm_shuffle_ps
(which is what MSVC uses in _MM_TRANSPOSE4_PS
but not GCC and ICC).
__m128 tmp0 ,tmp1, tmp2, tmp3;
tmp0 = _mm_shuffle_ps(row0, row1, 0x88); // 0 2 4 6
tmp1 = _mm_shuffle_ps(row0, row1, 0xdd); // 1 3 5 7
tmp2 = _mm_shuffle_ps(row2, row3, 0x88); // 8 a c e
tmp3 = _mm_shuffle_ps(row2, row3, 0xdd); // 9 b d f
row0 = _mm_shuffle_ps(tmp0, tmp2, 0x88); // 0 4 8 c
row1 = _mm_shuffle_ps(tmp1, tmp3, 0x88); // 1 5 9 d
row2 = _mm_shuffle_ps(tmp0, tmp2, 0xdd); // 2 6 a e
row3 = _mm_shuffle_ps(tmp1, tmp3, 0xdd); // 3 7 b f
the same idea applies to _mm512_shuffle_i32x4
but now the lanes are 128-bit instead of 32-bit and there are 16 rows instead of 4 rows.
Finally, to compare to scalar operations I modified Example 9.5a from Agner Fog's optimizing C++ manual
#define SIZE 16
void transpose(int a[SIZE][SIZE]) { // function to transpose matrix
// define a macro to swap two array elements:
#define swapd(x,y) {temp=x; x=y; y=temp;}
int r, c; int temp;
for (r = 1; r < SIZE; r++) {
for (c = 0; c < r; c++) {
swapd(a[r][c], a[c][r]);
}
}
}
this does n*(n-1)/2
swaps (because the diagonal does not need to be swapped). The swaps from assembly for 16x16 look like
mov r8d, DWORD PTR [rax+68]
mov r9d, DWORD PTR [rdx+68]
mov DWORD PTR [rax+68], r9d
mov DWORD PTR [rdx+68], r8d
so the number of read/write operations using the scalar registers is 2*n*(n-1)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With