Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I sum the four 2-bit bitfields in a single 8-bit byte?

I have four 2-bit bitfields stored in a single byte. Each bitfield can thus represent 0, 1, 2, or 3. For example, here are the 4 possible values where the first 3 bitfields are zero:

00 00 00 00 = 0 0 0 0
00 00 00 01 = 0 0 0 1
00 00 00 10 = 0 0 0 2
00 00 00 11 = 0 0 0 3

I'd like an efficient way to sum the four bitfields. For example:

11 10 01 00 = 3 + 2 + 1 + 0 = 6

A 8-bit lookup table on a modern Intel x64 CPU takes 4 cycles to return an answer from L1. It seems like there should be some way to calculate the answer faster than that. 3 cycles gives about room for 6-12 simple bit operations. As a starter, the straightforward mask and shift look like it will take 5 cycles on Sandy Bridge:

Assuming the bit fields are: d c b a, and that mask is: 00 00 00 11

Clarifying with help from Ira: This presumes that a, b, c, and d are identical and have all been set to the initial byte. Oddly, I think I can do this for free. Since I can do 2 loads per cycle, instead of loading byte once, I can just load it four times: a and d on the first cycle, b and c on the second. The second two loads will be delayed one cycle, but I don't need them until the second cycle. The splits below represent how things should break into separate cycles.

a = *byte
d = *byte

b = *byte
c = *byte

latency

latency

a &= mask
d >>= 6

b >>= 2
c >>= 4
a += d

b &= mask
c &= mask

b += c

a += b

A different encoding for the bitfields to make the logic easier would actually be fine, so long as it fits within a single byte and maps one-to-one with this scheme somehow. Dropping to assembly is also fine. Current target is Sandy Bridge, but targeting Haswell or beyond is fine too.

Application and motivation: I'm trying to make an open-source variable bit decompression routine go faster. Each bitfield represents the compressed length of each of the four integers that follow. I need the sum to know how many bytes I need to jump to get to the next group of four. The current loop takes 10 cycles, with 5 of that coming the lookup I'm trying to avoid. Shaving off a cycle would be ~10% improvement.

Edit:

Originally I said "8 cycles", but as Evgeny points out below I was wrong. As Evgeny points out, the only time there is an indirect 4 cycle load is if loading from the first 2K of system memory without using an index register. A correct list of latencies can be found in the Intel Architecture Optimization Manual Section 2.12

>    Data Type       (Base + Offset) > 2048   (Base + Offset) < 2048 
>                     Base + Index [+ Offset]
>     Integer                5 cycles               4 cycles
>     MMX, SSE, 128-bit AVX  6 cycles               5 cycles
>     X87                    7 cycles               6 cycles 
>     256-bit AVX            7 cycles               7 cycles

Edit:

I think this is how Ira's solution below would break into cycles. I think it also takes 5 cycles of work post load.

a = *byte
b = *byte

latency

latency 

latency

a &= 0x33
b >>= 2

b &= 0x33
c = a

a += b
c += b

a &= 7
c >>= 4

a += c 
like image 593
Nathan Kurz Avatar asked Jul 26 '13 11:07

Nathan Kurz


1 Answers

Would a builtin POPCOUNT instruction help?

n = POPCOUNT(byte&0x55);
n+= 2*POPCOUNT(byte&0xAA)

Or maybe

  word = byte + ((byte&0xAA) << 8);
  n = POPCOUNT(word);

Not sure about the total timing. This discussion says popcount has 3 cycles latency, 1 throughput.


UPDATE:
I may be missing some important fact about how to run IACA, but after a few experiments in the 12-11 throughput range, I compiled the following:

 uint32_t decodeFast(uint8_t *in, size_t count) {
  uint64_t key1 = *in;
  uint64_t key2;
  size_t adv;
  while (count--){
     IACA_START;
     key2=key1&0xAA;
     in+= __builtin_popcount(key1);
     adv= __builtin_popcount(key2);
     in+=adv+4;
     key1=*in;
  }
  IACA_END;
  return key1;
}

with gcc -std=c99 -msse4 -m64 -O3 test.c

and got 3.55 cycles !?!:

Block Throughput: 3.55 Cycles       Throughput Bottleneck: InterIteration
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           | 1.0 |           |           |     |     |    | popcnt edx,eax
|   1    | 0.9       |     |           |           |     | 0.1 | CP | and eax,0x55 
|   1    |           | 1.0 |           |           |     |     | CP | popcnt eax,eax
|   1    | 0.8       |     |           |           |     | 0.2 |    | movsxd rdx,edx
|   1    | 0.6       |     |           |           |     | 0.4 |    | add rdi, rdx
|   1    | 0.1       | 0.1 |           |           |     | 0.9 | CP | cdqe 
|   1    | 0.2       | 0.3 |           |           |     | 0.6 |    | sub rsi, 1
|   1    | 0.2       | 0.8 |           |           |     |     | CP | lea rdi,[rdi+rax+4] 
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     | CP | movzx eax,[rdi]
|   1    |           |     |           |           |     | 1.0 |    | jnz 0xffff

Two more ideas

A possible Micro-optimization to do the sum in 2 instructions

total=0;
PDEP(vals,0x03030303,*in);  #expands the niblets into bytes
PSADBW(total,vals) #total:= sum of abs(0-byte) for each byte in vals

Latency of each is supposedly 3, so this may not help. Maybe the byte-wise summation addition could be replaced with simple shifts and adds along the lines of AX=total+total>>16; ADD AL,AH

A macro-optimization:
You mention using the key as a lookup into a table of shuffle instructions. Why not just store the distance to the next key along with the shuffle instruction? Either store a bigger table, or possibly squeeze the 4 bit length into unused bits 3-6 of the shuffle key, at the expense of needing a mask to extract it.

like image 160
AShelly Avatar answered Oct 18 '22 21:10

AShelly