I need to popcnt in the most efficient (fastest) way an unsigned variable of 128 bits in size.
Although if solution is more portable, even better.
First of all, there are two types in GCC, which are __uint128_t
and unsigned __int128
. I guess they end up being the same, and see no reason to write the ugly unsigned __int128
thing, so although it is supposed to be the new type, I prefer the first one, which is more similar to the standard uint64_t
. Also, Intel has __uint128_t
which is another reason to use it (portability).
I have written the following code:
#include <nmmintrin.h>
#include <stdint.h>
static inline uint_fast8_t popcnt_u128 (__uint128_t n)
{
const uint64_t n_hi = n >> 64;
const uint64_t n_lo = n;
const uint_fast8_t cnt_hi = _mm_popcnt_u64(n_hi);
const uint_fast8_t cnt_lo = _mm_popcnt_u64(n_lo);
const uint_fast8_t cnt = cnt_hi + cnt_lo;
return cnt;
}
Is this the absolute fastest option?
Edit:
Another option came out of my mind, which may (or not) be faster:
#include <nmmintrin.h>
#include <stdint.h>
union Uint128 {
__uint128_t uu128;
uint64_t uu64[2];
};
static inline uint_fast8_t popcnt_u128 (__uint128_t n)
{
const union Uint128 n_u = {.uu128 = n};
const uint_fast8_t cnt_a = _mm_popcnt_u64(n_u.uu64[0]);
const uint_fast8_t cnt_b = _mm_popcnt_u64(n_u.uu64[1]);
const uint_fast8_t cnt = cnt_a + cnt_b;
return cnt;
}
This way, although I don't know if it is legal (is it? (Edit: it is: Type punning between integer and array using `union`?)), I would avoid the shift.
With GCC and clang, both your functions compile to identical asm if you remove the static inline
, and presumably will inline equivalently.
I'd suggest using unsigned
, because sizeof(uint_fast8_t)
= 1 on x86-64 Linux. The _fast
types beg the question "fast for what purpose"; fast8 is good for compact storage in arrays, fast32
is a 64-bit type which maybe avoids redoing sign or zero extension for pointer math but wastes space in array.
clang knows that the sum of two popcnt results fit in an 8-bit integer without overflow, so it can optimize away zero-extension even if you sum the result into an unsigned
counter, but gcc doesn't. (e.g. change the return type to unsigned
and you'll get an extra movzx eax, dil
instruction.) The hardware popcnt
instruction produces a result that's correctly zero-extended to 64-bit, but assigning to uint_fast8_t
aka uint8_t
is explicitly asking the compiler to truncate results to 8-bit.
The x86-64 System V ABI allows high garbage in args and return values, so when the return type is narrow a stand-alone version of the function can allow carry into the high bits of EAX.
I would avoid the shift.
The shift only exists in the C source. In the asm, the high/low halves will be stored in separate 64-bit registers, or separate memory source operands.
From the Godbolt compiler explorer
# gcc8.3 -O3 -march=haswell for the union and the shift version
popcnt_u128:
xor eax, eax # break popcnt's false dependency on Intel CPUs
popcnt rsi, rsi # _mm_popcnt_u64(n_hi);
popcnt rax, rdi # popcnt(lo)
add eax, esi # clang uses add al,cl and doesn't avoid false deps except in a loop
ret # return value in AL (low 8 bits of EAX)
GCC could have avoided the xor-zeroing by doing both popcnts in place, and using lea eax, [rdi + rsi]
. But you said something about an array, so if the data is coming from memory then GCC will normally mov-load and then popcnt in place to avoid the false dependency. (Why does breaking the "output dependency" of LZCNT matter?) Or actually, it will xor-zero the destination and then use memory-source popcnt, which might be slightly smaller code-size.
I don't trust __builtin_popcountll because it uses long long instead of uint64_t. I think it is insane to create a function that deals with bits and uses a type that isn't of fixed width. I don't know what GCC people were thinking about.
It actually uses unsigned long long
, not signed long long
; that would be insane.
unsigned long long
is at least 64 bits, and uint64_t
is required to be exactly 64 bits. (And in fact only exists on C implementations that have a type that's exactly 64 bits with no padding; support for it is optional). I'm not sure if GNU C supports any targets where unsigned long long
isn't 64 bits, or where uint64_t
isn't available. Or even int64_t
, which is also required to be 2's complement. (IDK if GCC supports any non-2's-complement targets.)
You can cast the inputs to uint64_t
to make sure there are no higher bits set. Implicit conversion from uint64_t
to unsigned long long
won't set any extra bits, even on a platform where ULL
is wider than 64 bits.
e.g. __builtin_popcountll( (uint64_t)n );
will always safely count the low 64 bits of n
, regardless of the width of unsigned long long
.
I'm using a very big static array. Do I have to care about cache, or does GCC handle that for me? I thought that was only a problem with malloc and that stuff. GCC knows the array at compile time, so it can do that better than me.
GCC will (almost?) never re-arrange your loops to change memory access patterns. Static arrays are not substantially different from malloc
ed memory; they don't stay hot in cache for free. See What Every Programmer Should Know About Memory? to learn more.
But if you're just looping sequentially through memory and popcounting a whole array, then it doesn't really matter whether you do it with __uint128_t
or not.
clang will auto-vectorize __builtin_popcntll
or _mm_popcnt_u64
over an array with AVX2 vpshufb
(as a nibble LUT), which is good on Intel CPUs including your Broadwell. See Counting 1 bits (population count) on large data using AVX-512 or AVX-2
But unfortunately using your wrapper function for an array of __uint128_t
defeats that. See the last 2 functions in the Godbolt link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With