How to unset N right-most set bits

Question

There is a relatively well-known trick for unsetting a single right-most bit:

y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)

I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?

Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).

// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000

I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.

Maybe there is some hardware support for it?

Peter Cordes · Accepted Answer

For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.

I'm not aware of other ISAs having a similar bit-packing hardware operation.

pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.

But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.

uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
    return _pdep_u64(-1ULL << n, a);
}

-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.

On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.

GCC and clang both compile it the obvious way, as written:

# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
        mov     rax, -1
        shlx    rax, rax, rsi
        pdep    rax, rax, rdi
        ret

(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)

So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).

And is only 3 uops for the front-end.

A semi-related BMI2 trick:

pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.

Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to @harold for spotting this in the first version of this answer.

Without fast PDEP: perhaps binary search for the right popcount

@Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.

Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).

Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.

High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).

Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.

How to unset N right-most set bits

Tags:

bit-manipulation

integer-arithmetic

intrinsics

qdot

1 Answers

Without fast PDEP: perhaps binary search for the right popcount

Peter Cordes

Recent Activity

Donate For Us

How to unset N right-most set bits

Tags:

bit-manipulation

integer-arithmetic

intrinsics

qdot

1 Answers

Without fast PDEP: perhaps binary search for the right popcount

Peter Cordes

Related questions

Recent Activity

Donate For Us