I'm trying to learn to use SSE, one of the programs I was making required the use of modulus division and so I wrote this to do it (sorry it's overcommented):
__m128i SSEModDiv(__m128i input, __m128i divisors)
{
//Error Checking (div by zero)
/*__m128i zeros = _mm_set1_epi32(0);
__m128i error = _mm_set1_epi32(-1);
__m128i zerocheck = _mm_cmpeq_epi32(zeros, divisors);
if (_mm_extract_epi16(zerocheck, 0) != 0)
return error;
if (_mm_extract_epi16(zerocheck, 2) != 0)
return error;
if (_mm_extract_epi16(zerocheck, 4) != 0)
return error;
if (_mm_extract_epi16(zerocheck, 6) != 0)
return error;*/
//Now for the real work
__m128 inputf = _mm_cvtepi32_ps(input);
__m128 divisorsf = _mm_cvtepi32_ps(divisors);
/*__m128 recip = _mm_rcp_ps(divisorsf); //Takes reciprocal
__m128 divided = _mm_mul_ps(inputf, recip); //multiplies by reciprical values*/
__m128 divided = _mm_div_ps(inputf, divisorsf);
__m128i intermediateint = _mm_cvttps_epi32(divided); //makes an integer version truncated
__m128 intermediate = _mm_cvtepi32_ps(intermediateint);
__m128 multiplied = _mm_mul_ps(intermediate, divisorsf); //multiplies the intermediate with the divisors
__m128 mods = _mm_sub_ps(inputf, multiplied); //subtracts to get moduli
return _mm_cvtps_epi32(mods);
}
The problem is this is about as fast as taking the modulus of each element of the four 32-bit integers individually in release and about 10 times slower in debug (found by profiling).
Could anyone give me any pointers on how to make this function faster?
-I can't use SVML be cause I'm using Visual Studio-
For general values of of input
and divisors
there are no useful SIMD x86 instructions for integer division or modulus so it's best to use scalar integer division. However, there are special cases where SIMD integer modulus can be done faster.
For example if you want to do (a+b)%c and a and b are already reduced (i.e. a<c
and b<c
) then you can use a comparison and a subtraction like this:
z = a + b
if(z>=c) z-=c;
I did this in this example vectorizing-modular-arithmetic
Another example is if the divisors are not compile time constants but are nevertheless constant within a loop then you can use an analogous idea from floating point division. A common trick in floating point division is to pre-compute the reciprocal of the divisor and do multiplication like this:
float fact = 1.0/x;
for(int i=0; i<n; i++) {
z[i] = fact*y[i]; //z[i] = y[i]/x;
}
You can use a similar technique for integer division which converts the integer division into integer multiplication with a shift.
y / x ≈ y * (2 n / x) >> n
There are several different techniques to determine the factor (aka magic number) (2 n / x)
and the shift n
. In fact most compilers will already do this for compile time constants and division. If you try for example x/7
and look at the assembly output from GCC or MSVC you will see that they don't actually do integer division, they do multiplication and a shift using the same magic numbers and shifts as calculated from http://www.hackersdelight.org/magic.htm.
I do this at runtime using Agner Fog's Vector Class Library or his Subroutine library. Both libraries will calculate the magic numbers and shifts for you at runtime for SSE and AVX integer division.
But as I said at the start of this answer if you want to do
for(int i=0; i<n; i++) {
z[i] = y[i]%x[i];
}
and x[i]
is not a constant within the loop it's best to stick to scalar division/modulus.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With