I have recently discovered that AVX2 doesn't have a popcount for __m256i and the only way I found to do something similar is to follow the Wojciech Mula algorithm's:
__m256i count(__m256i v) {
__m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,
2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4);
__m256i low_mask = _mm256_set1_epi8(0x0f);
__m256i lo =_mm256_and_si256(v,low_mask);
__m256i hi = _mm256_and_si256( _mm256_srli_epi32(v, 4), low_mask);
__m256i popcnt1 = _mm256_shuffle_epi8(lookup,lo);
__m256i popcnt2 = _mm256_shuffle_epi8(lookup,hi);
__m256i total = _mm256_add_epi8(popcnt1,popcnt2);
return _mm256_sad_epu8(total,_mm256_setzero_si256());
}
Wojciech Muła, Nathan Kurz, Daniel Lemire, Faster Population Counts Using AVX2 Instructions, Computer Journal 61 (1), 2018
The problem is that it return me the sum of 8 short into long instead of the sum of 4 short into int.
What's currently happening:
I have __m256i x which contain those 8 32-bit int:
__m256i res = count(x);
res contain:
The result is 4 long 64-bit
Expectation:
I have __m256i x which contain thoses 8 32-bit int:
__m256i res = count(x);
res contain:
The result is 8 int 32-bit.
Hope I was clear, don't hesitate to ask me for more precision.
Thanks.
AVX-512VPOPCNTDQ has _mm256_popcnt_epi32
to popcount in 32-bit chunks, also a 64-bit chunk size version. Outside of Xeon Phi, it's new in Ice Lake which also introduced AVX512BITALG which also has byte and word (16-bit) chunk sizes of vpopcnt
.
The original code you are quoting relies on the _mm256_sad_epu8
intrinsic, and it is specifically for summing up bytes within 64-bit words.
To get the same result, with sums of 32-bit words, you need to do something slightly different. The following should work:
__m256i popcount_pshufb32(__m256i v) {
__m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,
2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4);
__m256i low_mask = _mm256_set1_epi8(0x0f);
__m256i lo = _mm256_and_si256(v, low_mask);
__m256i hi = _mm256_and_si256(_mm256_srli_epi16(v, 4), low_mask);
__m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo);
__m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi);
__m256i sum8 = _mm256_add_epi8(popcnt1, popcnt2);
return _mm256_srli_epi32(
_mm256_mullo_epi32(sum8, _mm256_set1_epi32(0x01010101)), 24);
// vpmulld is slowish (2 uops) on most recent Intel CPUs
// but still single-uop on AMD
}
So we replaced _mm256_sad_epu8
by a multiplication and a shift. That should be reasonable. In my tests, it is slightly slower than the original 64-bit version, but the difference is relatively small.
You can get slightly better performance on Intel at the cost of one more vector constant, by using a different two instructions to accumulate from bytes to 32-bit chunks. AMD Zen1/2/3 is at least as efficient with the above version as below.
32-bit SIMD-integer multiply is 2 uops on recent Intel CPUs (both for the SIMD-integer-multiply units), but the pairwise multiply-accumulate instructions (8->16 and 16->32) are a single uop each. (https://uops.info/) This requires one more constant, but the same number of instructions, for fewer uops especially if the compiler can reuse the constants in a loop.
__m256i popcount_pshufb32(__m256i v) {
__m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,
2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4);
__m256i low_mask = _mm256_set1_epi8(0x0f);
__m256i lo = _mm256_and_si256(v, low_mask);
__m256i hi = _mm256_and_si256(_mm256_srli_epi16(v, 4), low_mask);
__m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo);
__m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi);
__m256i sum8 = _mm256_add_epi8(popcnt1, popcnt2);
return _mm256_madd_epi16(_mm256_maddubs_epi16(sum8, _mm256_set1_epi8(1)),
_mm256_set1_epi16(1));
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With