Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a good way of finding modulus of two variables using SSE? (without SVML)

Tags:

c++

sse

I'm trying to learn to use SSE, one of the programs I was making required the use of modulus division and so I wrote this to do it (sorry it's overcommented):

__m128i SSEModDiv(__m128i input, __m128i divisors)
{
    //Error Checking (div by zero)
    /*__m128i zeros = _mm_set1_epi32(0);
    __m128i error = _mm_set1_epi32(-1);
    __m128i zerocheck = _mm_cmpeq_epi32(zeros, divisors);
    if (_mm_extract_epi16(zerocheck, 0) != 0)
        return error;
    if (_mm_extract_epi16(zerocheck, 2) != 0)
        return error;
    if (_mm_extract_epi16(zerocheck, 4) != 0)
        return error;
    if (_mm_extract_epi16(zerocheck, 6) != 0)
        return error;*/

    //Now for the real work
    __m128 inputf = _mm_cvtepi32_ps(input);
    __m128 divisorsf = _mm_cvtepi32_ps(divisors);

    /*__m128 recip = _mm_rcp_ps(divisorsf); //Takes reciprocal
    __m128 divided = _mm_mul_ps(inputf, recip); //multiplies by reciprical values*/
    __m128 divided = _mm_div_ps(inputf, divisorsf);
    __m128i intermediateint = _mm_cvttps_epi32(divided); //makes an integer version truncated
    __m128 intermediate = _mm_cvtepi32_ps(intermediateint);
    __m128 multiplied = _mm_mul_ps(intermediate, divisorsf); //multiplies the intermediate with the divisors
    __m128 mods = _mm_sub_ps(inputf, multiplied); //subtracts to get moduli
    return _mm_cvtps_epi32(mods);
}

The problem is this is about as fast as taking the modulus of each element of the four 32-bit integers individually in release and about 10 times slower in debug (found by profiling).

Could anyone give me any pointers on how to make this function faster?

-I can't use SVML be cause I'm using Visual Studio-

like image 707
Kelechi Nze Avatar asked Jul 14 '14 19:07

Kelechi Nze


1 Answers

For general values of of input and divisors there are no useful SIMD x86 instructions for integer division or modulus so it's best to use scalar integer division. However, there are special cases where SIMD integer modulus can be done faster.

For example if you want to do (a+b)%c and a and b are already reduced (i.e. a<c and b<c) then you can use a comparison and a subtraction like this:

z = a + b
if(z>=c) z-=c;

I did this in this example vectorizing-modular-arithmetic

Another example is if the divisors are not compile time constants but are nevertheless constant within a loop then you can use an analogous idea from floating point division. A common trick in floating point division is to pre-compute the reciprocal of the divisor and do multiplication like this:

float fact = 1.0/x;
for(int i=0; i<n; i++) {
    z[i] = fact*y[i];  //z[i] = y[i]/x;
}

You can use a similar technique for integer division which converts the integer division into integer multiplication with a shift.

y / x ≈ y * (2 n / x) >> n

There are several different techniques to determine the factor (aka magic number) (2 n / x) and the shift n. In fact most compilers will already do this for compile time constants and division. If you try for example x/7 and look at the assembly output from GCC or MSVC you will see that they don't actually do integer division, they do multiplication and a shift using the same magic numbers and shifts as calculated from http://www.hackersdelight.org/magic.htm.

I do this at runtime using Agner Fog's Vector Class Library or his Subroutine library. Both libraries will calculate the magic numbers and shifts for you at runtime for SSE and AVX integer division.

But as I said at the start of this answer if you want to do

for(int i=0; i<n; i++) {
    z[i] = y[i]%x[i];
}

and x[i] is not a constant within the loop it's best to stick to scalar division/modulus.

like image 113
Z boson Avatar answered Sep 29 '22 16:09

Z boson