Most efficient popcount on `__uint128_t`?

Question

I need to popcnt in the most efficient (fastest) way an unsigned variable of 128 bits in size.

OS: Linux/Debian 9
Compiler: GCC 8
CPU: Intel i7-5775C

Although if solution is more portable, even better.

First of all, there are two types in GCC, which are __uint128_t and unsigned __int128. I guess they end up being the same, and see no reason to write the ugly unsigned __int128 thing, so although it is supposed to be the new type, I prefer the first one, which is more similar to the standard uint64_t. Also, Intel has __uint128_t which is another reason to use it (portability).

I have written the following code:

#include <nmmintrin.h>
#include <stdint.h>

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const uint64_t      n_hi    = n >> 64;
    const uint64_t      n_lo    = n;
    const uint_fast8_t  cnt_hi  = _mm_popcnt_u64(n_hi);
    const uint_fast8_t  cnt_lo  = _mm_popcnt_u64(n_lo);
    const uint_fast8_t  cnt     = cnt_hi + cnt_lo;

    return  cnt;
}

Is this the absolute fastest option?

Edit:

Another option came out of my mind, which may (or not) be faster:

#include <nmmintrin.h>
#include <stdint.h>

union   Uint128 {
    __uint128_t uu128;
    uint64_t    uu64[2];
};

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const union Uint128 n_u     = {.uu128   = n};
    const uint_fast8_t  cnt_a   = _mm_popcnt_u64(n_u.uu64[0]);
    const uint_fast8_t  cnt_b   = _mm_popcnt_u64(n_u.uu64[1]);
    const uint_fast8_t  cnt     = cnt_a + cnt_b;

    return  cnt;
}

This way, although I don't know if it is legal (is it? (Edit: it is: Type punning between integer and array using `union`?)), I would avoid the shift.

Peter Cordes · Accepted Answer

With GCC and clang, both your functions compile to identical asm if you remove the static inline, and presumably will inline equivalently.

I'd suggest using unsigned, because sizeof(uint_fast8_t) = 1 on x86-64 Linux. The _fast types beg the question "fast for what purpose"; fast8 is good for compact storage in arrays, fast32 is a 64-bit type which maybe avoids redoing sign or zero extension for pointer math but wastes space in array.

clang knows that the sum of two popcnt results fit in an 8-bit integer without overflow, so it can optimize away zero-extension even if you sum the result into an unsigned counter, but gcc doesn't. (e.g. change the return type to unsigned and you'll get an extra movzx eax, dil instruction.) The hardware popcnt instruction produces a result that's correctly zero-extended to 64-bit, but assigning to uint_fast8_t aka uint8_t is explicitly asking the compiler to truncate results to 8-bit.

The x86-64 System V ABI allows high garbage in args and return values, so when the return type is narrow a stand-alone version of the function can allow carry into the high bits of EAX.

I would avoid the shift.

The shift only exists in the C source. In the asm, the high/low halves will be stored in separate 64-bit registers, or separate memory source operands.

From the Godbolt compiler explorer

# gcc8.3 -O3 -march=haswell  for the union and the shift version
popcnt_u128:
    xor     eax, eax    # break popcnt's false dependency on Intel CPUs
    popcnt  rsi, rsi    # _mm_popcnt_u64(n_hi);
    popcnt  rax, rdi    # popcnt(lo)
    add     eax, esi        # clang uses add al,cl  and doesn't avoid false deps except in a loop
    ret                 # return value in AL (low 8 bits of EAX)

GCC could have avoided the xor-zeroing by doing both popcnts in place, and using lea eax, [rdi + rsi]. But you said something about an array, so if the data is coming from memory then GCC will normally mov-load and then popcnt in place to avoid the false dependency. (Why does breaking the "output dependency" of LZCNT matter?) Or actually, it will xor-zero the destination and then use memory-source popcnt, which might be slightly smaller code-size.

I don't trust __builtin_popcountll because it uses long long instead of uint64_t. I think it is insane to create a function that deals with bits and uses a type that isn't of fixed width. I don't know what GCC people were thinking about.

It actually uses unsigned long long, not signed long long; that would be insane.

unsigned long long is at least 64 bits, and uint64_t is required to be exactly 64 bits. (And in fact only exists on C implementations that have a type that's exactly 64 bits with no padding; support for it is optional). I'm not sure if GNU C supports any targets where unsigned long long isn't 64 bits, or where uint64_t isn't available. Or even int64_t, which is also required to be 2's complement. (IDK if GCC supports any non-2's-complement targets.)

You can cast the inputs to uint64_t to make sure there are no higher bits set. Implicit conversion from uint64_t to unsigned long long won't set any extra bits, even on a platform where ULL is wider than 64 bits.

e.g. __builtin_popcountll( (uint64_t)n ); will always safely count the low 64 bits of n, regardless of the width of unsigned long long.

I'm using a very big static array. Do I have to care about cache, or does GCC handle that for me? I thought that was only a problem with malloc and that stuff. GCC knows the array at compile time, so it can do that better than me.

GCC will (almost?) never re-arrange your loops to change memory access patterns. Static arrays are not substantially different from malloced memory; they don't stay hot in cache for free. See What Every Programmer Should Know About Memory? to learn more.

But if you're just looping sequentially through memory and popcounting a whole array, then it doesn't really matter whether you do it with __uint128_t or not.

clang will auto-vectorize __builtin_popcntll or _mm_popcnt_u64 over an array with AVX2 vpshufb (as a nibble LUT), which is good on Intel CPUs including your Broadwell. See Counting 1 bits (population count) on large data using AVX-512 or AVX-2

But unfortunately using your wrapper function for an array of __uint128_t defeats that. See the last 2 functions in the Godbolt link.

Most efficient popcount on `__uint128_t`?

Tags:

c

gcc

x86-64

intel

micro-optimization

alx

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Most efficient popcount on `__uint128_t`?

Tags:

c

gcc

x86-64

intel

micro-optimization

alx

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us