How to transpose a 16x16 matrix using SIMD instructions?

1 Answers

For two operand instructions using SIMD you can show that the number of operations necessary to transpose a nxn matrix is n*log_2(n) whereas using scalar operations it's O(n^2). In fact, later I'll show that the number of read and write operations using the scalar registers is 2*n*(n-1). Below is a table showing the number of operations to transpose 4x4, 8x8, 16x16, and 32x32 matrices using SSE, AVX, AVX512, and AVX1024 compared to the scalar operations

n            4(SSE)          8(AVX)    16(AVX512)    32(AVX1024)  
SIMD ops          8              24           64            160
SIMD +r/w ops    16              40           96            224     
Scalar r/w ops   24             112          480           1984

where SIMD +r/w ops includes the read and write operations (n*log_2(n) + 2*n).

The reason the SIMD transpose can be done in n*log_2(n) operations is that the algorithm is:

permute n 32-bit rows
permute n 64-bit rows
...
permute n simd_width/2-bit rows

For example, for 4x4 there are 4 rows and therefore you have to permute 32-bit lanes 4 times and then 64-bit lanes 4 times. For 16x16 you have to permute 32-bit lanes , 64-bit lanes, 128-bit lanes, and finally 256-lanes 16 times for each.

I already showed that 8x8 can be done with 24 operations with AVX. So the question is how to do this for 16x16 using AVX512 in 64 operations? The general algorithm is:

interleave 32-bit lanes using 
    8x _mm512_unpacklo_epi32
    8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
    8x _mm512_unpacklo_epi64 
    8x _mm512_unpackhi_epi64 
permute 128-bit lanes using
   16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
   16x _mm512_shuffle_i32x4

Here is untested code doing this

    //given __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;
    __m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;

    t0 = _mm512_unpacklo_epi32(r0,r1); //   0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29 
    t1 = _mm512_unpackhi_epi32(r0,r1); //   2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
    t2 = _mm512_unpacklo_epi32(r2,r3); //  32  48  33  49 ...
    t3 = _mm512_unpackhi_epi32(r2,r3); //  34  50  35  51 ...
    t4 = _mm512_unpacklo_epi32(r4,r5); //  64  80  65  81 ...  
    t5 = _mm512_unpackhi_epi32(r4,r5); //  66  82  67  83 ...
    t6 = _mm512_unpacklo_epi32(r6,r7); //  96 112  97 113 ...
    t7 = _mm512_unpackhi_epi32(r6,r7); //  98 114  99 115 ...
    t8 = _mm512_unpacklo_epi32(r8,r9); // 128 ...
    t9 = _mm512_unpackhi_epi32(r8,r9); // 130 ...
    ta = _mm512_unpacklo_epi32(ra,rb); // 160 ...
    tb = _mm512_unpackhi_epi32(ra,rb); // 162 ...
    tc = _mm512_unpacklo_epi32(rc,rd); // 196 ...
    td = _mm512_unpackhi_epi32(rc,rd); // 198 ...
    te = _mm512_unpacklo_epi32(re,rf); // 228 ...
    tf = _mm512_unpackhi_epi32(re,rf); // 230 ...

    r0 = _mm512_unpacklo_epi64(t0,t2); //   0  16  32  48 ...
    r1 = _mm512_unpackhi_epi64(t0,t2); //   1  17  33  49 ...
    r2 = _mm512_unpacklo_epi64(t1,t3); //   2  18  34  49 ...
    r3 = _mm512_unpackhi_epi64(t1,t3); //   3  19  35  51 ...
    r4 = _mm512_unpacklo_epi64(t4,t6); //  64  80  96 112 ...  
    r5 = _mm512_unpackhi_epi64(t4,t6); //  65  81  97 114 ...
    r6 = _mm512_unpacklo_epi64(t5,t7); //  66  82  98 113 ...
    r7 = _mm512_unpackhi_epi64(t5,t7); //  67  83  99 115 ...
    r8 = _mm512_unpacklo_epi64(t8,ta); // 128 144 160 176 ...  
    r9 = _mm512_unpackhi_epi64(t8,ta); // 129 145 161 178 ...
    ra = _mm512_unpacklo_epi64(t9,tb); // 130 146 162 177 ... 
    rb = _mm512_unpackhi_epi64(t9,tb); // 131 147 163 179 ...
    rc = _mm512_unpacklo_epi64(tc,te); // 192 208 228 240 ... 
    rd = _mm512_unpackhi_epi64(tc,te); // 193 209 229 241 ...
    re = _mm512_unpacklo_epi64(td,tf); // 194 210 230 242 ...
    rf = _mm512_unpackhi_epi64(td,tf); // 195 211 231 243 ...

    t0 = _mm512_shuffle_i32x4(r0, r4, 0x88); //   0  16  32  48   8  24  40  56  64  80  96  112 ...
    t1 = _mm512_shuffle_i32x4(r1, r5, 0x88); //   1  17  33  49 ...
    t2 = _mm512_shuffle_i32x4(r2, r6, 0x88); //   2  18  34  50 ...
    t3 = _mm512_shuffle_i32x4(r3, r7, 0x88); //   3  19  35  51 ...
    t4 = _mm512_shuffle_i32x4(r0, r4, 0xdd); //   4  20  36  52 ...
    t5 = _mm512_shuffle_i32x4(r1, r5, 0xdd); //   5  21  37  53 ...
    t6 = _mm512_shuffle_i32x4(r2, r6, 0xdd); //   6  22  38  54 ...
    t7 = _mm512_shuffle_i32x4(r3, r7, 0xdd); //   7  23  39  55 ...
    t8 = _mm512_shuffle_i32x4(r8, rc, 0x88); // 128 144 160 176 ...
    t9 = _mm512_shuffle_i32x4(r9, rd, 0x88); // 129 145 161 177 ...
    ta = _mm512_shuffle_i32x4(ra, re, 0x88); // 130 146 162 178 ...
    tb = _mm512_shuffle_i32x4(rb, rf, 0x88); // 131 147 163 179 ...
    tc = _mm512_shuffle_i32x4(r8, rc, 0xdd); // 132 148 164 180 ...
    td = _mm512_shuffle_i32x4(r9, rd, 0xdd); // 133 149 165 181 ...
    te = _mm512_shuffle_i32x4(ra, re, 0xdd); // 134 150 166 182 ...
    tf = _mm512_shuffle_i32x4(rb, rf, 0xdd); // 135 151 167 183 ...

    r0 = _mm512_shuffle_i32x4(t0, t8, 0x88); //   0  16  32  48  64  80  96 112 ... 240
    r1 = _mm512_shuffle_i32x4(t1, t9, 0x88); //   1  17  33  49  66  81  97 113 ... 241
    r2 = _mm512_shuffle_i32x4(t2, ta, 0x88); //   2  18  34  50  67  82  98 114 ... 242
    r3 = _mm512_shuffle_i32x4(t3, tb, 0x88); //   3  19  35  51  68  83  99 115 ... 243
    r4 = _mm512_shuffle_i32x4(t4, tc, 0x88); //   4 ...
    r5 = _mm512_shuffle_i32x4(t5, td, 0x88); //   5 ...
    r6 = _mm512_shuffle_i32x4(t6, te, 0x88); //   6 ...
    r7 = _mm512_shuffle_i32x4(t7, tf, 0x88); //   7 ...
    r8 = _mm512_shuffle_i32x4(t0, t8, 0xdd); //   8 ...
    r9 = _mm512_shuffle_i32x4(t1, t9, 0xdd); //   9 ...
    ra = _mm512_shuffle_i32x4(t2, ta, 0xdd); //  10 ...
    rb = _mm512_shuffle_i32x4(t3, tb, 0xdd); //  11 ...
    rc = _mm512_shuffle_i32x4(t4, tc, 0xdd); //  12 ...
    rd = _mm512_shuffle_i32x4(t5, td, 0xdd); //  13 ...
    re = _mm512_shuffle_i32x4(t6, te, 0xdd); //  14 ...
    rf = _mm512_shuffle_i32x4(t7, tf, 0xdd); //  15  31  47  63  79  96 111 127 ... 255

I got the idea for using _mm512_shufflei32x4 by looking at transposing a 4x4 matrix using _mm_shuffle_ps (which is what MSVC uses in _MM_TRANSPOSE4_PS but not GCC and ICC).

__m128 tmp0 ,tmp1, tmp2, tmp3;
tmp0 = _mm_shuffle_ps(row0, row1, 0x88); // 0 2 4 6
tmp1 = _mm_shuffle_ps(row0, row1, 0xdd); // 1 3 5 7
tmp2 = _mm_shuffle_ps(row2, row3, 0x88); // 8 a c e
tmp3 = _mm_shuffle_ps(row2, row3, 0xdd); // 9 b d f

row0 = _mm_shuffle_ps(tmp0, tmp2, 0x88); // 0 4 8 c 
row1 = _mm_shuffle_ps(tmp1, tmp3, 0x88); // 1 5 9 d
row2 = _mm_shuffle_ps(tmp0, tmp2, 0xdd); // 2 6 a e 
row3 = _mm_shuffle_ps(tmp1, tmp3, 0xdd); // 3 7 b f

the same idea applies to _mm512_shuffle_i32x4 but now the lanes are 128-bit instead of 32-bit and there are 16 rows instead of 4 rows.

Finally, to compare to scalar operations I modified Example 9.5a from Agner Fog's optimizing C++ manual

#define SIZE 16
void transpose(int a[SIZE][SIZE]) { // function to transpose matrix
    // define a macro to swap two array elements:
    #define swapd(x,y) {temp=x; x=y; y=temp;}
    int r, c; int temp;
    for (r = 1; r < SIZE; r++) {
        for (c = 0; c < r; c++) {
            swapd(a[r][c], a[c][r]);
        }
    }
}

this does n*(n-1)/2 swaps (because the diagonal does not need to be swapped). The swaps from assembly for 16x16 look like

mov     r8d, DWORD PTR [rax+68]
mov     r9d, DWORD PTR [rdx+68]
mov     DWORD PTR [rax+68], r9d
mov     DWORD PTR [rdx+68], r8d

so the number of read/write operations using the scalar registers is 2*n*(n-1).

121

answered Sep 21 '22 08:09

Z boson

Related questions
                            
                                Workflow for writing ARM assembly code on the iphone
                            
                                How can I get the source lines inline with the assembly output using GCC?
                            
                                How does a computer use just a few registers?
                            
                                Calling fsincos instruction in LLVM slower than calling libc sin/cos functions?
                            
                                What's the purpose of instructions for loading a register to itself?
                            
                                Multiple line comments in assembly
                            
                                Accessing specific memory locations in C
                            
                                Interleave bits efficiently
                            
                                What does asm("pause") do and why to use it
                            
                                How hard is it (really) to decompile assembly code? [closed]
                            
                                The difference between cmpl and cmp
                            
                                What does an equals sign = on the right side of a LDR instruction in ARM mean?
                            
                                Why doesn't the Rust optimizer remove those useless instructions (tested on Godbolt Compiler Explorer)?
                            
                                Why does compiler generate additional sqrts in the compiled assembly code
                            
                                GCC emits vastly different code using "-march=native" on similar architectures
                            
                                How to write to the console in fasm?
                            
                                How to get Python to use Assembly
                            
                                C++ on x86-64: when are structs/classes passed and returned in registers?
                            
                                Microarchitectural zeroing of a register via the register renamer: performance versus a mov?
                            
                                First time learning assembly, is this saying a word size is 8-bytes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to transpose a 16x16 matrix using SIMD instructions?

Tags:

assembly

matrix

intel

simd

avx512

lei_z

People also ask

1 Answers

Z boson

Recent Activity

Donate For Us