I need to perform the following operation:
w[i] = scale * v[i] + point
scale and point are fixed, whereas v[]
is a vector of 4-bit integers.
I need to compute w[]
for the arbitrary input vector v[]
and I want to speed up the process using AVX intrinsics. However, v[i]
is a vector of 4-bit integers.
The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following:
[a,b] + [c,d] = [a+b,c+d]
[a,b] * [c,d] = [a * b,c * d]
(Ignoring overflow)
Using AVX intrinsics, where [...,...] Is an 8-bit integer and a,b,c,d are 4-bit integers?
If yes, would it be possible to give a short example on how this could work?
Just a partial answer (only addition) and in pseudo code (should be easy to extent to AVX2 intrinsics):
uint8_t a, b; // input containing two nibbles each
uint8_t c = a + b; // add with (unwanted) carry between nibbles
uint8_t x = a ^ b ^ c; // bits which are result of a carry
x &= 0x10; // only bit 4 is of interest
c -= x; // undo carry of lower to upper nibble
If either a
or b
is known to have bit 4 unset (i.e. the lowest bit of the upper nibble), it can be left out the computation of x
.
As for multiplication: If scale
is the same for all products, you can likely get away with some shifting and adding/subtracting (masking out overflow bits where necessarry). Otherwise, I'm afraid you need to mask out 4 bits of each 16bit word, do the operation, and fiddle them together at the end. Pseudo code (there is no AVX 8bit multiplication, so we need to operate with 16bit words):
uint16_t m0=0xf, m1=0xf0, m2=0xf00, m3=0xf000; // masks for each nibble
uint16_t a, b; // input containing 4 nibbles each.
uint16_t p0 = (a*b) & m0; // lowest nibble, does not require masking a,b
uint16_t p1 = ((a>>4) * (b&m1)) & m1;
uint16_t p2 = ((a>>8) * (b&m2)) & m2;
uint16_t p3 = ((a>>12)* (b&m3)) & m3;
uint16_t result = p0 | p1 | p2 | p3; // join results together
For fixed a, b in w[i]=v[i] * a + b
, you can simply use a lookup table w_0_3 = _mm_shuffle_epi8(LUT_03, input)
for the LSB. Split the input to even and odd nibbles, with the odd LUT preshifted by 4.
auto a = input & 15; // per element
auto b = (input >> 4) & 15; // shift as 16 bits
return LUTA[a] | LUTB[b];
How to generate those LUTs dynamically, is another issue, if at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With