I want to divide an AVX2 vector by a constant. I visited this question and many other pages. Saw something that might help Fixed-point arithmetic and I didn't understand. So the problem is this division is the bottleneck. I tried two ways:
First, casting to float and do the operation with AVX instruction:
//outside the bottleneck:
__m256i veci16; // containing some integer numbers (16x16-bit numbers)
__m256 div_v = _mm256_set1_ps(div);
//inside the bottlneck
//some calculations which make veci16
vecps = _mm256_castsi256_ps (veci16);
vecps = _mm256_div_ps (vecps, div_v);
veci16 = _mm256_castps_si256 (vecps);
_mm256_storeu_si256((__m256i *)&output[i][j], veci16);
With the first method, the problem is: without division elapsed time is 5ns and with this elapsed time is about 60ns.
Second, I stored to an array and loaded it like this:
int t[16] ;
inline __m256i _mm256_div_epi16 (__m256i a , int b){
    _mm256_store_si256((__m256i *)&t[0] , a);
    t[0]/=b; t[1]/=b; t[2]/=b; t[3]/=b; t[4]/=b; t[5]/=b; t[6]/=b; t[7]/=b;
    t[8]/=b; t[9]/=b; t[10]/=b; t[11]/=b; t[12]/=b; t[13]/=b; t[14]/=b; t[15]/=b;
    return _mm256_load_si256((__m256i *)&t[0]);         
}
Well, it was better. But still elapsed time is 17ns. Calculations are too much to show here.
The question is: Is there any faster way to optimize this inline function?
You can do this with _mm256_mulhrs_epi16. This does a fixed-point multiply, so you just set the multiplicand vector to 32768 / b:
inline __m256i _mm256_div_epi16 (const __m256i va, const int b)
{
    __m256i vb = _mm256_set1_epi16(32768 / b);
    return _mm256_mulhrs_epi16(va, vb);
}
Note that this assumes b > 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With