What is a fast fallback algorithm which emulates PDEP and PEXT in software?

Question

I want to create a wrapper around the x86 instructions PDEP (Parallel Bit Deposit) and PEXT (Parallel Bit Extract). On architectures where these aren't available (and the corresponding intrinsics aren't either), I need a fast fallback implementation.

The naive algorithms for 32-bit integers looks as follows:

constexpr std::uint32_t bit_deposit(std::uint32_t src, std::uint32_t mask) {
    std::uint32_t result = 0;
    for (std::uint32_t src_pos = 0, mask_pos = 0; mask_pos != 32; ++mask_pos) {
        if (mask >> mask_pos & 1) {
            result |= (src >> src_pos++ & 1) << mask_pos;
        }
    }
    return result;
}

static_assert(bit_deposit(0b000, 0b000000) == 0b000000);
static_assert(bit_deposit(0b101, 0b101010) == 0b100010);
static_assert(bit_deposit(0b111, 0b101010) == 0b101010);

constexpr std::uint32_t bit_extract(std::uint32_t src, std::uint32_t mask) {
    std::uint32_t result = 0;
    for (std::uint32_t src_pos = 0, mask_pos = 0; mask_pos != 32; ++mask_pos) {
        if (mask >> mask_pos & 1) {
            result |= (src >> mask_pos & 1) << src_pos++;
        }
    }
    return result;
}

static_assert(bit_extract(0b000000, 0b000000) == 0b000);
static_assert(bit_extract(0b100010, 0b101010) == 0b101);
static_assert(bit_extract(0b101010, 0b101010) == 0b111);

Is there a way that you can significantly improve this? The current algorithm is O(n), and I suspect there is a O(log n) solution.

For a lot of other bit manipulation ops, there is logarithmic algorithm (multi-bit shift from single-bit shift, populatin count, round to next power of two, etc.). Therefore, I suspect that something similar exists for bit-deposit and bit-extract.

harold · Accepted Answer

In Hacker's Delight chapter 7, rearranging bits and bytes, there is this implementation for expand:

unsigned expand(unsigned x, unsigned m) {
    unsigned m0, mk, mp, mv, t;
    unsigned array[5];
    int i;
    m0 = m; // Save original mask.
    mk = ~m << 1; // We will count 0's to right.
    for (i = 0; i < 5; i++) {
        mp = mk ^ (mk << 1); // Parallel suffix.
        mp = mp ^ (mp << 2);
        mp = mp ^ (mp << 4);
        mp = mp ^ (mp << 8);
        mp = mp ^ (mp << 16);
        mv = mp & m; // Bits to move.
        array[i] = mv;
        m = (m ^ mv) | (mv >> (1 << i)); // Compress m.
        mk = mk & ~mp;
    }
    for (i = 4; i >= 0; i--) {
        mv = array[i];
        t = x << (1 << i);
        x = (x & ~mv) | (t & mv);
    }
    return x & m0; // Clear out extraneous bits.
}

And this implementation for compress:

unsigned compress(unsigned x, unsigned m) {
    unsigned mk, mp, mv, t;
    int i;
    x = x & m; // Clear irrelevant bits.
    mk = ~m << 1; // We will count 0's to right.
    for (i = 0; i < 5; i++) {
        mp = mk ^ (mk << 1); // Parallel suffix.
        mp = mp ^ (mp << 2);
        mp = mp ^ (mp << 4);
        mp = mp ^ (mp << 8);
        mp = mp ^ (mp << 16);
        mv = mp & m; // Bits to move.
        m = m ^ mv | (mv >> (1 << i)); // Compress m.
        t = x & mv;
        x = x ^ t | (t >> (1 << i)); // Compress x.
        mk = mk & ~mp;
    }
    return x;
}

These doesn't make k iterations for a k-bit integer, but I don't know whether these are actually a good way to do it. I've heard a rumour that you can do better, but I haven't heard how, maybe it's not even true.

Peter Cordes · Answer

Compress/expand is I think a primitive operation that can't easily or efficiently be composed from other building-blocks in the general case for an unknown mask. Ben Voigt's answer has some good insight into why we should expect this to remain an unsolved problem.

It's only a good idea to use a general-case implementation if you know you'll get true hardware support, or if that implementation is able to optimize to something much simpler for some compile-time-constant masks. (Or as a slow compat fallback for portability and testing in code that's intended to run only on platforms known to have fast support.)

But anyway, I think a useful criterion or goal for a proposed library implementations is how non-terrible it is with regular patterns like 0x0f0f0f0f to expand or pack nibbles to bytes. Optimizations like compiling a mask=0x80808080 pext into movd xmm0, edi / pmovmskb eax, xmm0 would be up to the compiler, probably not a lot we can do to help with that in a scalar implementation.

Fast HW support:

Intel with BMI2 (P-cores since Haswell, E-cores since Gracemont: 1 uop, 3c latency, 1c throughput)
AMD since Zen 3, same as Intel. (Zen 2, Zen 1, and Excavator have disastrously slow microcoded pext/pdep with a data-dependent number of uops, with 52 cycle throughput cost being possible, and that might not even be the worst case, IDK. So unfortunately the BMI2 feature bit doesn't mean it's practically worth using.)
AArch64 with SVE2 BITPERM can do this on each 8, 16, 32 or 64-bit element of a vector. For scalar integer use-cases, you'd need to get data into and back out of a vector reg. (Cortex-A710's opt guide says bext/bdep/bgrp are 6c latency, one per 2 cycle throughput.) An intro to SVE2 guide mentions HPC genomics as a use-case, perhaps for packing/unpacking 2-bit per base DNA sequences, or 6-bit codons (3 DNA bases).
RISC-V extension V can do vcompress.vm which is the equivalent operation on vector elements, like AVX-512 vpcompressb. So you could expand the bits of one input to 0/-1 vector elements and use the other as a compress mask, then compare/test back into an integer bitmask result.
@harold commented there's a note in the spec for how to build vdecompress (which it doesn't directly have) in terms of a couple of other instructions.

So already there's at least three mainstream ISAs with hardware support for this primitive operation.

I'm working on a proposal to add std::deposit_bits to C++26

It's certainly a good idea to lay the groundwork for portable code to take advantage of this primitive operation by getting it into C++26 even if it's not widely available outside x86 yet. Unlike with std::countr_zero / std::countl_zero where C++20 was 35 years late (386 bsf/bsr, ARM clz or rbit+clz, widespread across many ISAs), or like std::popcount, which at least was portably exposed by good library implementations of std::bitset<>::count.

Even if C++26 comes out in 2026, not all projects will be ready to switch, so with a couple more years a larger fraction of x86 hardware still in use will support -march=x86-64-v3 with fast (not slow microcoded) pdep/pext.

ARM has had rbit (bit-reverse) for a very long time but it's not common on other ISAs. Still, Rust has a builtin for it, i32.reverse_bits(), taking the sensible approach of having language built-ins for many simple things that any ISAs have, instead of needing target-specific intrinsics or recognizing a loop idiom to get efficient asm. (Rust doesn't have pdep / pext builtins, though.)

C++'s <bit> header has a byte-reverse but not a bit-reverse function. There are platform-specific ways to do it fast, such as x86 with SSSE3 pshufb to unpack 8 bytes to 16 nibbles and use a lookup table, but it's not plausible to take advantage of that from portable code. (And presumably a log(n) bithack is possible with lots of masks, as a generic fallback implementation.) So if we're adding stuff to <bit> in C++20, that's another good one to consider.

As further practical evidence of pdep/pext being hard to synthesize from "standard" ops like boolean and add/sub carry-propagation, AMD Zen 2 and earlier had very slow microcoded pdep/pext, with a data-dependent number of uops. https://uops.info/ measured 119 / 133 uops respectively for the 64-bit memory-source version with one per ~50 cycle throughput. They measured the reg source version at only 6 or 7 uops with 18c throughput, but probably that was just different inputs (initializer code not shown), and 133 uops might not even be the worst case, just far from the best.

Many real use-cases involve some known patterns, e.g. perhaps 4-bit groups go together, or the mask is even a constant. In those cases, there's almost always something faster than Zen 2 microcoded pdep/pext which software should use instead.

In a few years, Zen 2 will be sufficiently obsolete that some software compiled for -march=x86-64-v3 can reasonably assume pdep/pext to be usable, but any dynamic CPU dispatching should check CPU version and not use a pdep/pext version on Zen2 and earlier even though BMI2 is available.

The fact that this is hard are why pdep/pext exist, along with AVX-512 vpexpand / vpcompress - same thing for vector elements according to a k mask. For left-packing with x86-64-v3 (AVX2 what is the most efficient way to pack left based on a mask?), the best I or anyone else has come up with is a way to use pext to pack integer indices and generate a shuffle-control from it.

Other Q&As with various implementations of pext and/or pdep, none with log2(width) number of steps. Usually involving something like mask &= (mask - 1); clear lowest-set bit to loop over the 1s in one of the inputs.

Standard C++11 code equivalent to the PEXT Haswell instruction (and likely to be optimized by compiler) / https://codereview.stackexchange.com/questions/278954/c-pext-pdep-implementations - still a no on idiom-recognition or efficiency.
Bit twiddle help: Expanding bits to follow a given bitmask (njuffa's attempt; he comments that https://www.chessprogramming.org/BMI2#Serial_Implementation looks probably faster)
Emulate AVX512 VPCOMPESSB byte packing without AVX512_VBMI2 (the same problem but for vector elements.) - unanswered

If the answers here are actually more efficient, some of those should get closed as a duplicate of this.

What is a fast fallback algorithm which emulates PDEP and PEXT in software?

Tags:

c++

optimization

bit-manipulation

x86

bmi

Jan Schultke

2 Answers

harold

Peter Cordes

Recent Activity

Donate For Us

What is a fast fallback algorithm which emulates PDEP and PEXT in software?

Tags:

c++

optimization

bit-manipulation

x86

bmi

Jan Schultke

2 Answers

harold

Peter Cordes

Related questions

Recent Activity

Donate For Us