I have four 2-bit bitfields stored in a single byte. Each bitfield can thus represent 0, 1, 2, or 3. For example, here are the 4 possible values where the first 3 bitfields are zero:
00 00 00 00 = 0 0 0 0
00 00 00 01 = 0 0 0 1
00 00 00 10 = 0 0 0 2
00 00 00 11 = 0 0 0 3
I'd like an efficient way to sum the four bitfields. For example:
11 10 01 00 = 3 + 2 + 1 + 0 = 6
A 8-bit lookup table on a modern Intel x64 CPU takes 4 cycles to return an answer from L1. It seems like there should be some way to calculate the answer faster than that. 3 cycles gives about room for 6-12 simple bit operations. As a starter, the straightforward mask and shift look like it will take 5 cycles on Sandy Bridge:
Assuming the bit fields are: d c b a
, and that mask is: 00 00 00 11
Clarifying with help from Ira: This presumes that a
, b
, c
, and d
are identical and have all been set to the initial byte
. Oddly, I think I can do this for free. Since I can do 2 loads per cycle, instead of loading byte
once, I can just load it four times: a
and d
on the first cycle, b
and c
on the second. The second two loads will be delayed one cycle, but I don't need them until the second cycle. The splits below represent how things should break into separate cycles.
a = *byte
d = *byte
b = *byte
c = *byte
latency
latency
a &= mask
d >>= 6
b >>= 2
c >>= 4
a += d
b &= mask
c &= mask
b += c
a += b
A different encoding for the bitfields to make the logic easier would actually be fine, so long as it fits within a single byte and maps one-to-one with this scheme somehow. Dropping to assembly is also fine. Current target is Sandy Bridge, but targeting Haswell or beyond is fine too.
Application and motivation: I'm trying to make an open-source variable bit decompression routine go faster. Each bitfield represents the compressed length of each of the four integers that follow. I need the sum to know how many bytes I need to jump to get to the next group of four. The current loop takes 10 cycles, with 5 of that coming the lookup I'm trying to avoid. Shaving off a cycle would be ~10% improvement.
Originally I said "8 cycles", but as Evgeny points out below I was wrong. As Evgeny points out, the only time there is an indirect 4 cycle load is if loading from the first 2K of system memory without using an index register. A correct list of latencies can be found in the Intel Architecture Optimization Manual Section 2.12
> Data Type (Base + Offset) > 2048 (Base + Offset) < 2048
> Base + Index [+ Offset]
> Integer 5 cycles 4 cycles
> MMX, SSE, 128-bit AVX 6 cycles 5 cycles
> X87 7 cycles 6 cycles
> 256-bit AVX 7 cycles 7 cycles
I think this is how Ira's solution below would break into cycles. I think it also takes 5 cycles of work post load.
a = *byte
b = *byte
latency
latency
latency
a &= 0x33
b >>= 2
b &= 0x33
c = a
a += b
c += b
a &= 7
c >>= 4
a += c
Would a builtin POPCOUNT instruction help?
n = POPCOUNT(byte&0x55);
n+= 2*POPCOUNT(byte&0xAA)
Or maybe
word = byte + ((byte&0xAA) << 8);
n = POPCOUNT(word);
Not sure about the total timing. This discussion says popcount has 3 cycles latency, 1 throughput.
UPDATE:
I may be missing some important fact about how to run IACA, but after a few experiments in the 12-11 throughput range, I compiled the following:
uint32_t decodeFast(uint8_t *in, size_t count) {
uint64_t key1 = *in;
uint64_t key2;
size_t adv;
while (count--){
IACA_START;
key2=key1&0xAA;
in+= __builtin_popcount(key1);
adv= __builtin_popcount(key2);
in+=adv+4;
key1=*in;
}
IACA_END;
return key1;
}
with gcc -std=c99 -msse4 -m64 -O3 test.c
and got 3.55 cycles !?!:
Block Throughput: 3.55 Cycles Throughput Bottleneck: InterIteration
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | | 1.0 | | | | | | popcnt edx,eax
| 1 | 0.9 | | | | | 0.1 | CP | and eax,0x55
| 1 | | 1.0 | | | | | CP | popcnt eax,eax
| 1 | 0.8 | | | | | 0.2 | | movsxd rdx,edx
| 1 | 0.6 | | | | | 0.4 | | add rdi, rdx
| 1 | 0.1 | 0.1 | | | | 0.9 | CP | cdqe
| 1 | 0.2 | 0.3 | | | | 0.6 | | sub rsi, 1
| 1 | 0.2 | 0.8 | | | | | CP | lea rdi,[rdi+rax+4]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | CP | movzx eax,[rdi]
| 1 | | | | | | 1.0 | | jnz 0xffff
Two more ideas
A possible Micro-optimization to do the sum in 2 instructions
total=0;
PDEP(vals,0x03030303,*in); #expands the niblets into bytes
PSADBW(total,vals) #total:= sum of abs(0-byte) for each byte in vals
Latency of each is supposedly 3, so this may not help. Maybe the byte-wise summation addition could be replaced with simple shifts and adds along the lines of AX=total+total>>16; ADD AL,AH
A macro-optimization:
You mention using the key as a lookup into a table of shuffle instructions. Why not just store the distance to the next key along with the shuffle instruction? Either store a bigger table, or possibly squeeze the 4 bit length into unused bits 3-6 of the shuffle key, at the expense of needing a mask to extract it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With