Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AVX 4-bit integers

I need to perform the following operation:

 w[i] = scale * v[i] + point

scale and point are fixed, whereas v[] is a vector of 4-bit integers.

I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers.

The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following:

[a,b] + [c,d] = [a+b,c+d]

[a,b] * [c,d] = [a * b,c * d]

(Ignoring overflow)

Using AVX intrinsics, where [...,...] Is an 8-bit integer and a,b,c,d are 4-bit integers?

If yes, would it be possible to give a short example on how this could work?

like image 359
user227837 Avatar asked May 16 '17 20:05

user227837


2 Answers

Just a partial answer (only addition) and in pseudo code (should be easy to extent to AVX2 intrinsics):

uint8_t a, b;          // input containing two nibbles each

uint8_t c = a + b;     // add with (unwanted) carry between nibbles
uint8_t x = a ^ b ^ c; // bits which are result of a carry
x &= 0x10;             // only bit 4 is of interest
c -= x;                // undo carry of lower to upper nibble

If either a or b is known to have bit 4 unset (i.e. the lowest bit of the upper nibble), it can be left out the computation of x.

As for multiplication: If scale is the same for all products, you can likely get away with some shifting and adding/subtracting (masking out overflow bits where necessarry). Otherwise, I'm afraid you need to mask out 4 bits of each 16bit word, do the operation, and fiddle them together at the end. Pseudo code (there is no AVX 8bit multiplication, so we need to operate with 16bit words):

uint16_t m0=0xf, m1=0xf0, m2=0xf00, m3=0xf000; // masks for each nibble

uint16_t a, b; // input containing 4 nibbles each.

uint16_t p0 = (a*b) & m0; // lowest nibble, does not require masking a,b
uint16_t p1 = ((a>>4) * (b&m1)) & m1;
uint16_t p2 = ((a>>8) * (b&m2)) & m2;
uint16_t p3 = ((a>>12)* (b&m3)) & m3;

uint16_t result = p0 | p1 | p2 | p3;  // join results together 
like image 191
chtz Avatar answered Oct 22 '22 13:10

chtz


For fixed a, b in w[i]=v[i] * a + b, you can simply use a lookup table w_0_3 = _mm_shuffle_epi8(LUT_03, input) for the LSB. Split the input to even and odd nibbles, with the odd LUT preshifted by 4.

auto a = input & 15; // per element
auto b = (input >> 4) & 15; // shift as 16 bits
return LUTA[a] | LUTB[b];

How to generate those LUTs dynamically, is another issue, if at all.

like image 2
Aki Suihkonen Avatar answered Oct 22 '22 12:10

Aki Suihkonen