Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

byte array permute SSE optimization

I would like to translate this code using SSE intrinsics. I found the pshufb SSSE3 instruction and similar __builtin_ia32_pshufb128(v128i, v128i) GCC intrinsic that may possibly be used with this code. The code permutes a vector of bytes s by index k via swapping bytes in the array with specific way.

void permutation(int k, std::vector<char> & s) 
{
  for(size_t j = 1; j < s.size(); ++j) 
  {
      std::swap(s[k % (j + 1)], s[j]); 
      k = k / (j + 1);
  }
}

I spent a good hour thinking how to translate the code to pshufb. Is it possible to permute 16-bytes with single pshufb or does it require multiple instructions? Good enough solution would permute just 16 bytes at time.

EDIT: Further context of the problem: I'm iterating over all possible permutations of s. Computing ahead k = 0, 1, 2,... multiple results for same s is ok. However I need to reproduce the k-th permutation later preferably as O(1) operation.

like image 342
JATothrim Avatar asked Jan 06 '17 08:01

JATothrim


1 Answers

Single call

Notice that you can write down the number k in positional notation system with mixed radix, so that each digit in this representation would define indices of swapped elements for several consecutive values of j.

For example, for strings of length 12 you can write any k as a three-digit number with bases:

720 = 1*2*3*4*5*6  (0-th digit, lowest value)
504 = 7*8*9        (1-th digit)
1320 = 10*11*12    (2-th digit, highest value)

Now you can precompute for each position and for each value of digit in this position the cumulative permutation of all your elements, and save it in a lookup table. Then you would be able to do several swaps by single instruction.

Here is a sample (precomputation would be the hardest part):

//to be precomputed:
__m128i mask0[ 720];
__m128i mask1[ 504];
__m128i mask2[1320];

__m128i permutation(int k, __m128i s) {
    s = _mm_shuffle_epi8(s, mask0[k %  720]); k /=  720;  //j = [1..5]
    s = _mm_shuffle_epi8(s, mask1[k %  504]); k /=  504;  //j = [6..8]
    s = _mm_shuffle_epi8(s, mask2[k       ]);             //j = [9..11]
    return s;
}

You can vary the decomposition into bases in order to balance between number of steps and size of lookup table.

Note: division operation is very slow. Use only divisions by compile-time constants, then the optimizer would transform them into multiplications. Check the generated assembly to make sure that no division instructions are there.

Many calls

Unfortunately, index calculations would still eat most of the time with suggested solution, see generated assembly. This overhead can be significantly reduced if you process several consecutive values of k at once.

The simplest approach to optimize the solution would be: iterate over digits of k separately instead of doing a single loop over k. Then index calculations become unnecessary. Also, you can reuse partially computed results.

__m128i s;// = ???
for (int k0 = 0; k0 <  720; k0++) {
    __m128i s0 = _mm_shuffle_epi8(s, mask0[k0]);
    for (int k1 = 0; k1 <  504; k1++) {
        __m128i s1 = _mm_shuffle_epi8(s0, mask1[k1]);
        for (int k2 = 0; k2 < 1320; k2+=4) {
            //for k = (((k2+0) * BASE1) + k1) * BASE0 + k0:
            __m128i sx0 = _mm_shuffle_epi8(s1, mask2[k2+0]);
            //for k = (((k2+1) * BASE1) + k1) * BASE0 + k0:
            __m128i sx1 = _mm_shuffle_epi8(s1, mask2[k2+1]);
            //for k = (((k2+2) * BASE1) + k1) * BASE0 + k0:
            __m128i sx2 = _mm_shuffle_epi8(s1, mask2[k2+2]);
            //for k = (((k2+3) * BASE1) + k1) * BASE0 + k0:
            __m128i sx3 = _mm_shuffle_epi8(s1, mask2[k2+3]);

            // ... check four strings: sx0, sx1, sx2, sx3
        }
    }
}

This way you need to do one shuffle per each permutation on average (see assembly), which seems close to perfect.

Code and results

Here is the full working code of all solutions.

Note that generation of the lookup tables is not trivial to fully explain, and the corresponding code is rather large (and filled with nasty details).

The benchmark run on Intel Core 2 Duo E4700 Allendale (2600MHz) gives results:

2.605 s: original code         (k < 12739451)
0.125 s: single-call fast code (k < 12739451)
4.822 s: single-call fast code (k < 479001600)
0.749 s: many-call fast code   (k < 479001600)

So the single-call version is about 20 times faster than the original code, and the many-call version is about 6.5 times faster than the single-call version.

like image 59
stgatilov Avatar answered Nov 02 '22 04:11

stgatilov