There is _mm_div_ps for floating-point values division, there is _mm_mullo_epi16 for integer multiplication. But is there something for integer division (16 bits value)? How can i conduct such division?
The % (integer divide) operator divides two numbers and returns the integer part of the result. The result returned is defined to be that which would result from repeatedly subtracting the divisor from the dividend while the dividend is larger than the divisor.
Java does integer division, which basically is the same as regular real division, but you throw away the remainder (or fraction). Thus, 7 / 3 is 2 with a remainder of 1. Throw away the remainder, and the result is 2. Integer division can come in very handy.
The division function in Python has two variations: Float division: gives a decimal answer. Integer division: gives the answer in whole numbers (the division result is rounded to the nearest whole number).
Agner Fog's (http://www.agner.org/optimize/#vectorclass) method works great if division is done with a single divisor. Furthermore, this method has even further benefits if the divisor is known at compile time, or if it doesn't change often at runtime.
However, when performing SIMD division on __m128i
elements such that no information is available for both the divisor and the dividend upon compile time, we have no option, but to convert to float and perform the computation. On the other, using _mm_div_ps
will not provide us with amazing speed improvements, as this instruction has variable latency of 11 to 14 cycles on most micro-architectures and sometimes can go up to 38 cycles if we consider Knights Landing. On top of that this instruction is not fully pipelined, and has reciprocal throughput of 3-6 cycles depending on the micro-architecture.
However we can avoid _mm_div_ps
and use _mm_rcp_ss
instead.
Unfortunately __m128 _mm_rcp_ss (__m128 a)
is fast only because it provides approximation. Namely (taken from the Intel Intrnisics Guide):
Compute the approximate reciprocal of the lower single-precision (32-bit) floating-point element in a, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst. The maximum relative error for this approximation is less than 1.5*2^-12.
Therefore, to benefit from _mm_rcp_ss
, we need to compensate for the loss due to approximation. There is a great work done in this direction, available in Improved division by invariant integers by Niels Möller and Torbjörn Granlund:
Due to lack of efficient division instructions in current processors, the division is performed as a multiplication using a precomputed single-word approximation of the reciprocal of the divisor, followed by a couple of adjustment steps.
To calculate 16-bit signed integer division, we only need one adjustment step, and we base our solution accordingly.
static inline __m128i _mm_div_epi16(const __m128i &a_epi16, const __m128i &b_epi16) {
//
// Setup the constants.
//
const __m128 two = _mm_set1_ps(2.00000051757f);
const __m128i lo_mask = _mm_set1_epi32(0xFFFF);
//
// Convert to two 32-bit integers
//
const __m128i a_hi_epi32 = _mm_srai_epi32(a_epi16, 16);
const __m128i a_lo_epi32_shift = _mm_slli_epi32(a_epi16, 16);
const __m128i a_lo_epi32 = _mm_srai_epi32(a_lo_epi32_shift, 16);
const __m128i b_hi_epi32 = _mm_srai_epi32(b_epi16, 16);
const __m128i b_lo_epi32_shift = _mm_slli_epi32(b_epi16, 16);
const __m128i b_lo_epi32 = _mm_srai_epi32(b_lo_epi32_shift, 16);
//
// Convert to 32-bit floats
//
const __m128 a_hi = _mm_cvtepi32_ps(a_hi_epi32);
const __m128 a_lo = _mm_cvtepi32_ps(a_lo_epi32);
const __m128 b_hi = _mm_cvtepi32_ps(b_hi_epi32);
const __m128 b_lo = _mm_cvtepi32_ps(b_lo_epi32);
//
// Calculate the reciprocal
//
const __m128 b_hi_rcp = _mm_rcp_ps(b_hi);
const __m128 b_lo_rcp = _mm_rcp_ps(b_lo);
//
// Calculate the inverse
//
#ifdef __FMA__
const __m128 b_hi_inv_1 = _mm_fnmadd_ps(b_hi_rcp, b_hi, two);
const __m128 b_lo_inv_1 = _mm_fnmadd_ps(b_lo_rcp, b_lo, two);
#else
const __m128 b_mul_hi = _mm_mul_ps(b_hi_rcp, b_hi);
const __m128 b_mul_lo = _mm_mul_ps(b_lo_rcp, b_lo);
const __m128 b_hi_inv_1 = _mm_sub_ps(two, b_mul_hi);
const __m128 b_lo_inv_1 = _mm_sub_ps(two, b_mul_lo);
#endif
//
// Compensate for the loss
//
const __m128 b_hi_rcp_1 = _mm_mul_ps(b_hi_rcp, b_hi_inv_1);
const __m128 b_lo_rcp_1 = _mm_mul_ps(b_lo_rcp, b_lo_inv_1);
//
// Perform the division by multiplication
//
const __m128 hi = _mm_mul_ps(a_hi, b_hi_rcp_1);
const __m128 lo = _mm_mul_ps(a_lo, b_lo_rcp_1);
//
// Convert back to integers
//
const __m128i hi_epi32 = _mm_cvttps_epi32(hi);
const __m128i lo_epi32 = _mm_cvttps_epi32(lo);
//
// Zero-out the unnecessary parts
//
const __m128i hi_epi32_shift = _mm_slli_epi32(hi_epi32, 16);
#ifdef __SSE4_1__
//
// Blend the bits, and return
//
return _mm_blend_epi16(lo_epi32, hi_epi32_shift, 0xAA);
#else
//
// Blend the bits, and return
//
const __m128i lo_epi32_mask = _mm_and_si128(lo_epi32, const_mm_div_epi16_lo_mask);
return _mm_or_si128(hi_epi32_shift, lo_epi32_mask);
#endif
}
This solution can work using SSE2
only and will make use of FMA
if available. However, it could be that using plain division is as fast (or even faster) as using the approximation.
In the presence of AVX
this solution can be improved, as the high and lo-parts can be processed at the same time using one AVX
register.
Validation
As we are dealing with 16-bits only, we can easily validate the correctness of the solution in few seconds using brute-force testing:
void print_epi16(__m128i a)
{
int i; int16_t tmp[8];
_mm_storeu_si128( (__m128i*) tmp, a);
for (i = 0; i < 8; i += 1) {
printf("%8d ", (int) tmp[i]);
}
printf("\n");
}
bool run_mm_div_epi16(const int16_t *a, const int16_t *b)
{
const size_t n = 8;
int16_t result_expected[n];
int16_t result_obtained[n];
//
// Derive the expected result
//
for (size_t i = 0; i < n; i += 1) {
result_expected[i] = a[i] / b[i];
}
//
// Now perform the computation
//
const __m128i va = _mm_loadu_si128((__m128i *) a);
const __m128i vb = _mm_loadu_si128((__m128i *) b);
const __m128i vr = _mm_div_epi16(va, vb);
_mm_storeu_si128((__m128i *) result_obtained, vr);
//
// Check for array equality
//
bool eq = std::equal(result_obtained, result_obtained + n, result_expected);
if (!eq) {
cout << "Testing of _mm_div_epi16 failed" << endl << endl;
cout << "a: ";
print_epi16(va);
cout << "b: ";
print_epi16(vb);
cout << endl;
cout << "results_obtained: ";
print_epi16(vr);
cout << "results_expected: ";
print_epi16(_mm_loadu_si128((__m128i *) result_expected));
cout << endl;
}
return eq;
}
void test_mm_div_epi16()
{
const int n = 8;
bool correct = true;
//
// Brute-force testing
//
int16_t a[n];
int16_t b[n];
for (int32_t i = INT16_MIN; correct && i <= INT16_MAX; i += n) {
for (int32_t j = 0; j < n; j += 1) {
a[j] = (int16_t) (i + j);
}
for (int32_t j = INT16_MIN; correct && j < 0; j += 1) {
const __m128i jv = _mm_set1_epi16((int16_t) j);
_mm_storeu_si128((__m128i *) b, jv);
correct = correct && run_mm_div_epi16(a, b);
}
for (int32_t j = 1; correct && j <= INT16_MAX; j += 1) {
const __m128i jv = _mm_set1_epi16((int16_t) j);
_mm_storeu_si128((__m128i *) b, jv);
correct = correct && run_mm_div_epi16(a, b);
}
}
if (correct) {
cout << "Done!" << endl;
} else {
cout << "_mm_div_epi16 can not be validated" << endl;
}
}
Having the solution above, AVX2
implementation is straight-forward:
static inline __m256i _mm256_div_epi16(const __m256i &a_epi16, const __m256i &b_epi16) {
//
// Setup the constants.
//
const __m256 two = _mm256_set1_ps(2.00000051757f);
//
// Convert to two 32-bit integers
//
const __m256i a_hi_epi32 = _mm256_srai_epi32(a_epi16, 16);
const __m256i a_lo_epi32_shift = _mm256_slli_epi32(a_epi16, 16);
const __m256i a_lo_epi32 = _mm256_srai_epi32(a_lo_epi32_shift, 16);
const __m256i b_hi_epi32 = _mm256_srai_epi32(b_epi16, 16);
const __m256i b_lo_epi32_shift = _mm256_slli_epi32(b_epi16, 16);
const __m256i b_lo_epi32 = _mm256_srai_epi32(b_lo_epi32_shift, 16);
//
// Convert to 32-bit floats
//
const __m256 a_hi = _mm256_cvtepi32_ps(a_hi_epi32);
const __m256 a_lo = _mm256_cvtepi32_ps(a_lo_epi32);
const __m256 b_hi = _mm256_cvtepi32_ps(b_hi_epi32);
const __m256 b_lo = _mm256_cvtepi32_ps(b_lo_epi32);
//
// Calculate the reciprocal
//
const __m256 b_hi_rcp = _mm256_rcp_ps(b_hi);
const __m256 b_lo_rcp = _mm256_rcp_ps(b_lo);
//
// Calculate the inverse
//
const __m256 b_hi_inv_1 = _mm256_fnmadd_ps(b_hi_rcp, b_hi, two);
const __m256 b_lo_inv_1 = _mm256_fnmadd_ps(b_lo_rcp, b_lo, two);
//
// Compensate for the loss
//
const __m256 b_hi_rcp_1 = _mm256_mul_ps(b_hi_rcp, b_hi_inv_1);
const __m256 b_lo_rcp_1 = _mm256_mul_ps(b_lo_rcp, b_lo_inv_1);
//
// Perform the division by multiplication
//
const __m256 hi = _mm256_mul_ps(a_hi, b_hi_rcp_1);
const __m256 lo = _mm256_mul_ps(a_lo, b_lo_rcp_1);
//
// Convert back to integers
//
const __m256i hi_epi32 = _mm256_cvttps_epi32(hi);
const __m256i lo_epi32 = _mm256_cvttps_epi32(lo);
//
// Blend the low and the high-parts
//
const __m256i hi_epi32_shift = _mm256_slli_epi32(hi_epi32, 16);
return _mm256_blend_epi16(lo_epi32, hi_epi32_shift, 0xAA);
}
We can use the same method described above to perform validation of the code.
We can evaluate the performance using the measure flops per cycle (F/C). In this case scenario, we like to see how many divisions we can perform per cycle. For that purpose we define two vectors a
and b
and perform point-wise division. Both a
and b
are populated with random data using xorshift32, initialized with uint32_t state = 3853970173;
I use RDTSC
to measure the cycles, performing 15 repetitions with warm cache, and using the median as a result. To avoid the effects of frequency scaling and resource sharing on the measurements, Turbo Boost and Hyper-Threading are disabled. To run the code I use Intel Xeon CPU E3-1285L v3
3.10GHz Haswell with 32GB of RAM and 25.6 GB/s bandwidth to main memory, running Debian GNU/Linux 8 (jessie), kernel 3.16.43-2+deb8u3
. gcc
used is 4.9.2-10
. Results are given below:
Pure SSE2
Implementation
We compare plain division against the algorithm proposed above:
===============================================================
= Compiler & System info
===============================================================
Current CPU : Intel(R) Xeon(R) CPU E3-1285L v3 @ 3.10GHz
CXX Compiler ID : GNU
CXX Compiler Path : /usr/bin/c++
CXX Compiler Version : 4.9.2
CXX Compiler Flags : -O3 -std=c++11 -msse2 -mno-fma
--------------------------------------------------------------------------------
| Size | Division F/C | Division B/W | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
| 128 | 0.5714286 | 26911.45 MB/s | 0.5019608 | 23634.21 MB/s |
| 256 | 0.5714286 | 26909.17 MB/s | 0.5039370 | 23745.44 MB/s |
| 512 | 0.5707915 | 26928.14 MB/s | 0.5039370 | 23763.79 MB/s |
| 1024 | 0.5707915 | 26936.33 MB/s | 0.5039370 | 23776.85 MB/s |
| 2048 | 0.5709507 | 26938.51 MB/s | 0.5039370 | 23780.25 MB/s |
| 4096 | 0.5708711 | 26940.56 MB/s | 0.5039990 | 23782.65 MB/s |
| 8192 | 0.5708711 | 26940.16 MB/s | 0.5039370 | 23781.85 MB/s |
| 16384 | 0.5704735 | 26921.76 MB/s | 0.4954040 | 23379.24 MB/s |
| 32768 | 0.5704537 | 26921.26 MB/s | 0.4954639 | 23382.13 MB/s |
| 65536 | 0.5703147 | 26914.53 MB/s | 0.4943539 | 23330.13 MB/s |
| 131072 | 0.5691680 | 26860.21 MB/s | 0.4929539 | 23264.40 MB/s |
| 262144 | 0.5690618 | 26855.60 MB/s | 0.4929187 | 23262.22 MB/s |
| 524288 | 0.5691378 | 26858.75 MB/s | 0.4929488 | 23263.56 MB/s |
| 1048576 | 0.5677474 | 26794.14 MB/s | 0.4918968 | 23214.34 MB/s |
| 2097152 | 0.5371243 | 25348.39 MB/s | 0.4700511 | 22183.07 MB/s |
| 4194304 | 0.5128146 | 24200.82 MB/s | 0.4529809 | 21377.28 MB/s |
| 8388608 | 0.5036971 | 23770.36 MB/s | 0.4438345 | 20945.84 MB/s |
| 16777216 | 0.5005390 | 23621.14 MB/s | 0.4409909 | 20811.32 MB/s |
| 33554432 | 0.4992792 | 23561.90 MB/s | 0.4399777 | 20763.49 MB/s |
--------------------------------------------------------------------------------
We can observe how the plain division will be slightly faster than the proposed approximation step. In this case scenario, we can conclude that using SSE2
approximation will be suboptimal, on a Haswell microarchitecture.
However, if we run the same results on an older, Sandy Bridge machine, such as Intel(R) Xeon(R) CPU X5680 @ 3.33GHz, we can already see the benefit of the approximation:
===============================================================
= Compiler & System info
===============================================================
Current CPU : Intel(R) Xeon(R) CPU X5680 @ 3.33GHz
CXX Compiler ID : GNU
CXX Compiler Path : /usr/bin/c++
CXX Compiler Version : 4.8.5
CXX Compiler Flags : -O3 -std=c++11 -msse2 -mno-fma
--------------------------------------------------------------------------------
| Size | Division F/C | Division B/W | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
| 128 | 0.2857143 | 14511.41 MB/s | 0.3720930 | 18899.89 MB/s |
| 256 | 0.2853958 | 14512.51 MB/s | 0.3715530 | 18898.91 MB/s |
| 512 | 0.2853958 | 14510.53 MB/s | 0.3715530 | 18896.44 MB/s |
| 1024 | 0.2853162 | 14511.81 MB/s | 0.3700759 | 18824.00 MB/s |
| 2048 | 0.2853162 | 14511.04 MB/s | 0.3708130 | 18860.31 MB/s |
| 4096 | 0.2852964 | 14511.16 MB/s | 0.3711826 | 18879.27 MB/s |
| 8192 | 0.2852666 | 14510.23 MB/s | 0.3713172 | 18886.39 MB/s |
| 16384 | 0.2852616 | 14509.86 MB/s | 0.3712920 | 18885.60 MB/s |
| 32768 | 0.2852244 | 14507.93 MB/s | 0.3712709 | 18884.86 MB/s |
| 65536 | 0.2851003 | 14501.41 MB/s | 0.3701114 | 18826.14 MB/s |
| 131072 | 0.2850711 | 14499.95 MB/s | 0.3685017 | 18743.58 MB/s |
| 262144 | 0.2850745 | 14500.47 MB/s | 0.3684799 | 18742.78 MB/s |
| 524288 | 0.2848062 | 14486.66 MB/s | 0.3681040 | 18723.63 MB/s |
| 1048576 | 0.2846679 | 14479.64 MB/s | 0.3671284 | 18674.02 MB/s |
| 2097152 | 0.2840133 | 14446.52 MB/s | 0.3664623 | 18640.01 MB/s |
| 4194304 | 0.2745241 | 13963.13 MB/s | 0.3488823 | 17745.24 MB/s |
| 8388608 | 0.2741900 | 13946.39 MB/s | 0.3476036 | 17680.37 MB/s |
| 16777216 | 0.2740689 | 13940.32 MB/s | 0.3477076 | 17685.97 MB/s |
| 33554432 | 0.2746752 | 13970.75 MB/s | 0.3482017 | 17711.36 MB/s |
--------------------------------------------------------------------------------
Would be even nicer to see how this would behave on even older machines say Nehalem (assuming it has RCP
support).
SSE41
+ FMA
Implementation
We compare the plain division against the algorithm proposed above, enabling FMA
and SSE41
===============================================================
= Compiler & System info
===============================================================
Current CPU : Intel(R) Xeon(R) CPU E3-1285L v3 @ 3.10GHz
CXX Compiler ID : GNU
CXX Compiler Path : /usr/bin/c++
CXX Compiler Version : 4.9.2
CXX Compiler Flags : -O3 -std=c++11 -msse4.1 -mfma
--------------------------------------------------------------------------------
| Size | Division F/C | Division B/W | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
| 128 | 0.5714286 | 26884.20 MB/s | 0.5423729 | 25506.41 MB/s |
| 256 | 0.5701559 | 26879.92 MB/s | 0.5412262 | 25503.95 MB/s |
| 512 | 0.5701559 | 26904.68 MB/s | 0.5423729 | 25584.65 MB/s |
| 1024 | 0.5704735 | 26911.46 MB/s | 0.5429480 | 25622.57 MB/s |
| 2048 | 0.5704735 | 26915.03 MB/s | 0.5433802 | 25640.09 MB/s |
| 4096 | 0.5703941 | 26917.72 MB/s | 0.5435965 | 25651.63 MB/s |
| 8192 | 0.5703544 | 26915.85 MB/s | 0.5436687 | 25656.76 MB/s |
| 16384 | 0.5699972 | 26898.44 MB/s | 0.5262583 | 24834.54 MB/s |
| 32768 | 0.5699873 | 26898.93 MB/s | 0.5262076 | 24833.21 MB/s |
| 65536 | 0.5698882 | 26894.48 MB/s | 0.5250567 | 24778.35 MB/s |
| 131072 | 0.5697024 | 26885.50 MB/s | 0.5224302 | 24654.59 MB/s |
| 262144 | 0.5696950 | 26884.72 MB/s | 0.5223095 | 24649.49 MB/s |
| 524288 | 0.5696937 | 26885.37 MB/s | 0.5223308 | 24650.21 MB/s |
| 1048576 | 0.5690340 | 26854.14 MB/s | 0.5220133 | 24634.71 MB/s |
| 2097152 | 0.5455717 | 25746.56 MB/s | 0.5041949 | 23794.65 MB/s |
| 4194304 | 0.5125461 | 24188.11 MB/s | 0.4756604 | 22447.05 MB/s |
| 8388608 | 0.5043430 | 23800.67 MB/s | 0.4659974 | 21991.51 MB/s |
| 16777216 | 0.5017375 | 23677.94 MB/s | 0.4614457 | 21776.58 MB/s |
| 33554432 | 0.5005865 | 23623.50 MB/s | 0.4596277 | 21690.63 MB/s |
--------------------------------------------------------------------------------
FMA
+ SSE4.1
does give us some level of improvements, but this is not good enough.
AVX2
+ FMA
implementation
Finally, we can see a real benefit comparing AVX2
plain division against the approximation method:
===============================================================
= Compiler & System info
===============================================================
Current CPU : Intel(R) Xeon(R) CPU E3-1285L v3 @ 3.10GHz
CXX Compiler ID : GNU
CXX Compiler Path : /usr/bin/c++
CXX Compiler Version : 4.9.2
CXX Compiler Flags : -O3 -std=c++11 -march=haswell
--------------------------------------------------------------------------------
| Size | Division F/C | Division B/W | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
| 128 | 0.5663717 | 26672.73 MB/s | 0.9481481 | 44627.89 MB/s |
| 256 | 0.5651214 | 26653.72 MB/s | 0.9481481 | 44651.56 MB/s |
| 512 | 0.5644983 | 26640.36 MB/s | 0.9463956 | 44660.99 MB/s |
| 1024 | 0.5657459 | 26689.41 MB/s | 0.9552239 | 45044.21 MB/s |
| 2048 | 0.5662151 | 26715.40 MB/s | 0.9624060 | 45405.33 MB/s |
| 4096 | 0.5663717 | 26726.27 MB/s | 0.9671783 | 45633.64 MB/s |
| 8192 | 0.5664500 | 26732.42 MB/s | 0.9688941 | 45724.83 MB/s |
| 16384 | 0.5699377 | 26896.04 MB/s | 0.9092624 | 42909.11 MB/s |
| 32768 | 0.5699675 | 26897.85 MB/s | 0.9087077 | 42883.21 MB/s |
| 65536 | 0.5699625 | 26898.59 MB/s | 0.9001456 | 42480.91 MB/s |
| 131072 | 0.5699253 | 26896.38 MB/s | 0.8926057 | 42124.09 MB/s |
| 262144 | 0.5699117 | 26895.58 MB/s | 0.8928610 | 42137.13 MB/s |
| 524288 | 0.5698622 | 26892.87 MB/s | 0.8928002 | 42133.63 MB/s |
| 1048576 | 0.5685829 | 26833.13 MB/s | 0.8894302 | 41974.25 MB/s |
| 2097152 | 0.5558453 | 26231.90 MB/s | 0.8371921 | 39508.55 MB/s |
| 4194304 | 0.5224387 | 24654.67 MB/s | 0.7436747 | 35094.81 MB/s |
| 8388608 | 0.5143588 | 24273.46 MB/s | 0.7185252 | 33909.08 MB/s |
| 16777216 | 0.5107452 | 24103.19 MB/s | 0.7133449 | 33664.28 MB/s |
| 33554432 | 0.5101245 | 24074.10 MB/s | 0.7125114 | 33625.03 MB/s |
--------------------------------------------------------------------------------
This method can definitely provide a speed-up against plain division. How much speed up can actually be attained, it really depends on the underlying architecture, as well as how the division interacts with the rest of the application logic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With