fastest way to convert two-bit number to low-memory representation

Question

I have a 56-bit number with potentially two set bits, e.g., 00000000 00000000 00000000 00000000 00000000 00000000 00000011. In other words, two bits are distributed among 56 bits, so that we have bin(56,2)=1540 possible permutations.

I now look for a loss-free mapping of such an 56 bit number to an 11-bit number that can carry 2048 and therefore also 1540. Knowing the structure, this 11-bit number is enough to store the value of my low-density (of ones) 56 bit number.

I want to maximize performance (this function should run millions or even billions of times per second if possible). So far, I only came up with some loop:

int inputNumber = 24; // 11000
int bitMask = 1;
int bit1 = 0, bit2 = 0;    
for(int n = 0; n < 54; ++n, bitMask *= 2)
{
    if((inputNumber & bitMask) != 0)
    {
        if(bit1 != 0)
            bit1 = n;
        else
        {
            bit2 = n;
            break;
        }
    }
}

and using these two bits, I can easily generate some 1540 max number.

But is there no faster version than using such a loop?

Peter Cordes · Accepted Answer

Most ISAs have hardware support for a bit-scan instruction that finds the position of a set bit. Use that instead of a naive loop or bithack for any architecture where you care about this running fast. https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious has some tricks that are better than nothing, but those are all still much worse than a single efficient asm instruction.

But ISO C++ doesn't portably expose clz/ctz operations; it's only available via intrinsics / builtins for various implementations. (And the x86 intrinsincs have quirks for all-zero input, corresponding to the asm instruction behaviour).

For some ISAs, it's a count-leading-zeros giving you 31 - highbit_index. For others, it's a CTZ count trailing zeros operation, giving you the index of the low bit. x86 has both. (And its high-bit finder actually directly finds the high-bit index, not a leading-zero count, unless you use BMI1 lzcnt instead of traditional bsr) https://en.wikipedia.org/wiki/Find_first_set has a table of what different ISAs have.

GCC portably provides __builtin_clz and __builtin_ctz; on ISAs without hardware support, they compile to a call to a helper functions. See What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? and Implementation of __builtin_clz

(For 64-bit integers, you want the long long versions: like __builtin_ctzll GCC manual.)

If we only have a CLZ, use high=63-CLZ(n) and low= 63-CLZ((-n) & n) to isolate the low bit. Note that x86's bsr instruction actually produces 63-CLZ(), i.e. the bit-index instead of the leading-zero count. So 63-__builtin_clzll(n) can compile to a single instruction on x86; IIRC gcc does notice this. Or 2 instructions if GCC uses an extra xor-zeroing to avoid the inconvenient false dependency.

If we only have CTZ, do low = CTZ(n) and high = CTZ(n & (n - 1)) to clear the lowest set bit. (Leaving the high bit, assuming the number has exactly 2 set bits).

If we have both, low = CTZ(n) and high = 63-CLZ(n). I'm not sure what GCC does on non-x86 ISAs where they aren't both available natively. The GCC builtins are always available even when targeting HW that doesn't have it. But the internal implementation can't use the above tricks because it doesn't know there are always exactly 2 bits set.

(I wrote out the full formulas; an earlier version of this answer had CLZ and CTZ reversed in this part. I find that happens to me easily, especially when I also have to keep track of x86's bsr and bsr (bitscan reverse and forward) and remember that those are leading and trailing, respectively.)

So if you just use both CTZ and CLZ, you might end up with slow emulation for one of them. Or fast emulation on ARM with rbit to bit-reverse for clz, which is 100% fine.

AVX512CD has SIMD VPLZCNTQ for 64-bit integers, so you could encode 2, 4, or 8x 64-bit integers in parallel with that on recent Intel CPUs. For SSSE3 or AVX2, you can build a SIMD lzcnt by using pshufb _mm_shuffle_epi8 byte-shuffle as a 4-bit LUT and combining with _mm_max_epu8. There was a recent Q&A about this but I can't find it. (It might have been for 16-bit integers only; wider requires more work.)

With this, a Skylake-X or Cascade Lake CPU could maybe compress 8x 64-bit integers per 2 or 3 clock cycles once you factor in the throughput cost of packing the results. SIMD is certainly useful for packing 12-bit or 11-bit results into a contiguous bitstream, e.g. with variable-shift instructions, if that's what you want to do with the results. At ~3 or 4GHz clock speed, that could maybe get you over 10 billion per clock with a single thread. But only if the inputs come from contiguous memory. Depending what you want to do with the results, it might cost a few more cycles to do more than just pack them down to 16-bit integers. e.g. to pack into a bitstream. But SIMD should be good for that with variable-shift instructions that can line up the 11 or 12 bits from each register into the right position to OR together after shuffling.

There's a tradeoff between coding efficiency and encode performance. Using 12 bits for two 6-bit indices (of bit positions) is very simple both to compress and decompress, at least on hardware that has bit-scan instructions.

Or instead of bit-indices, one or both could be leading zero counts, so decoding would be (1ULL << 63) >> a. 1ULL>>63 is a fixed constant that you can actually right-shift, or the compiler could turn it into a left-shift of 1ULL << (63-a) which IIRC optimizes to 1 << (-a) in assembly for ISAs like x86 where shift instructions mask the shift count (look only at the low 6 bits).

Also, 2x 12 bits is a whole number of bytes, but 11 bits only gives you a whole number of bytes every 8 outputs, if you're packing them. So indexing a bit-packed array is simpler.

0 is still a special case: maybe handle that by using all-ones bit-indices (i.e. index = bit 63, which is outside the low 56 bits). On decode/decompress, you set the 2 bit positions (1ULL<<a) | (1ULL<<b) and then & mask to clear high bits. Or bias your bit indices and have decode right shift by 1.

If we didn't have to handle zero then a modern x86 CPU could do 1 or 2 billion encodes per second if it didn't have to do anything else. e.g. Skylake has 1 per clock throughput for bit-scan instructions and should be able to encode at 1 number per 2 clocks just bottlenecked on that. (Or maybe better with SIMD). With just 4 scalar instructions, we can get the low and high indices (64-bit tzcnt + bsr), shift by 6 bits, and OR together.¹ Or on AMD, avoid bsr / bsf and manually do 63-lzcnt.

A branchy or branchless check for input == 0 to to set the final result to whatever hard-coded constant (like 63 , 63) should be cheap, though.

Compression on other ISAs like AArch64 is also cheap. It has clz but not ctz. Probably your best bet there is use an intrinsic for rbit to bit-reverse a number (so clz on the bit-reversed number directly gives you the bit-index of the low bit. Which is now the high bit of the reversed version.) Assuming rbit is as fast as add / sub, this is cheaper than using multiple instructions to clear the low bit.

If you really want 11 bits then you need to avoid the redundancy of 2x 6-bit being able to have either index larger than the other. Like maybe have 6-bit a and 5-bit b, and have a<=b mean something special like b+=32. I haven't thought this through fully. You need to be able to encode 2 adjacent bits either near the top or bottom of the registers, or the 2 set bits could be as far apart as 28 bits, if we consider wrapping at the boundaries like a 56-bit rotate.

Melpomene's suggestion to isolate the low and high set bits might be useful as part of something else, but is only useful for encoding on targets where you only have one direction of bit-scan available, not both. Even so, you wouldn't actually use both expressions. Leading-zero count doesn't require you to isolate the low bit, you just need to clear it to get at the high bit.

Footnote 1: decoding on x86 is also cheap: x |= (1<<a) is 1 instruction: bts. But many compilers have missed optimizations and don't notice this, instead actually shifting a 1. bts reg, reg is 1 uop / 1 cycle latency on Intel since PPro, or sometimes 2 uops on AMD. (Only the memory destination version is slow.) https://agner.org/optimize/

Best encoding performance on AMD CPUs requires BMI1 tzcnt / lzcnt because bsr and bsf are slower (6 uops instead of 1 https://agner.org/optimize/). On Ryzen, lzcnt is 1 uop, 1c latency, 4 per clock throughput. But tzcnt is 2 uops.

With BMI1, the compiler could use blsr to clear the lowest set bit of a register (and copy it). i.e. modern x86 has an instruction for dst = (SRC-1) bitwiseAND ( SRC ); that are single-uop on Intel but 2 uops on AMD.

But with lzcnt being more efficient than tzcnt on AMD Ryzen, probably the best asm for AMD doesn't use it.

Or maybe something like this (assuming exactly 2 bits, which apparently we can do).

(This asm is what you'd like to get your compiler to emit. Don't actually use inline asm!)

Ryzen_encode_scalar:    ; input in RDI, output in EAX
   lzcnt    rcx, rdi       ; 63-high bit index
   tzcnt    rdx, rdi       ; low bit

   mov      eax, 63
   sub      eax, ecx

   shl      edx, 6
   or       eax, edx       ; (low_bit << 6) | high_bit

   ret                     ; goes away with inlining.

Shifting the low bit-index balances the lengths of the critical path, allowing better instruction-level parallelism, if we need 63-CLZ for the high bit.

Throughput: 7 uops total, and no execution-unit bottlenecks. So at 5 uops per clock pipeline width, that's better than 1 per 2 clocks.

Skylake_encode_scalar:    ; input in RDI, output in EAX
   tzcnt    rax, rdi       ; low bit.  No false dependency on Skylake.  GCC will probably xor-zero RAX because there is on Broadwell and earlier.
   bsr      rdi, rdi       ; high bit index.  same,same reg avoids false dep
   shl      eax, 6
   or       eax, edx

   ret                     ; goes away with inlining.

This has 5 cycle latency from input to output: bitscan instructions are 3 cycles on Intel vs. 1 on AMD. SHL + OR each add 1 cycle.

For throughput, we only bottleneck on one bit-scan per cycle (execution port 1), so we can do one encode per 2 cycles with 4 uops of front-end bandwidth left over for load, store, and loop overhead (or something else), assuming we have multiple independent encodes to do.

(But for the multiple independent encode case, SIMD may still be better for both AMD and Intel, if a cheap emulation of vplzcntq exists and the data is coming from memory.)

Scalar decode can be something like this:

decode:    ;; input in EDI, output in RAX
    xor   eax, eax           ; RAX=0
    bts   rax, rdi           ; RAX |= 1ULL << (high_bit_idx & 63)

    shr   edi, 6             ; extract low_bit_idx
    bts   rax, rdi           ; RAX |= 1ULL << low_bit_idx

    ret

This has 3 shifts (including the bts) which on Skylake can only run on port0 or port6. So on Intel it only costs 4 uops for the front-end (so 1 per clock as part of doing something else). But if doing only this, it bottlenecks on shift throughput at 1 decode per 1.5 clock cycles.

On a 4GHz CPU, that's 2.666 billion decodes per second, so yeah we're doing pretty well hitting your targets :)

Or Ryzen, bts reg,reg is 2 uops , with 0.5c throughput, but shr can run on any port. So it doesn't steal throughput from bts, and the whole thing is 6 uops (vs. Ryzen's pipeline being 5-wide at the narrowest point). So 1 encode per 1.2 clock cycles, just bottlenecked on front-end cost.

With BMI2 available, starting with a 1 in a register and using shlx rax, rbx, rdi can replace the xor-zeroing + first BTS with a single uop, assuming the 1 in a register can be reused in a loop.

(This optimization is totally dependent on your compiler to find; flag-less shifts are just more efficient ways to copy-and-shift that become available with -march=haswell or -march=znver1, or other targets that have BMI2.)

Either way you're just going to write retval = 1ULL << (packed & 63) for decoding the first bit. But if you're wondering which compilers make nice code here, this is what you're looking for.

fastest way to convert two-bit number to low-memory representation

Tags:

c++

performance

micro-optimization

IceFire

1 Answers

Peter Cordes

Recent Activity

Donate For Us

fastest way to convert two-bit number to low-memory representation

Tags:

c++

performance

micro-optimization

IceFire

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us