AVX 4-bit integers

Question

I need to perform the following operation:

 w[i] = scale * v[i] + point

scale and point are fixed, whereas v[] is a vector of 4-bit integers.

I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers.

The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following:

[a,b] + [c,d] = [a+b,c+d]

[a,b] * [c,d] = [a * b,c * d]

(Ignoring overflow)

Using AVX intrinsics, where [...,...] Is an 8-bit integer and a,b,c,d are 4-bit integers?

If yes, would it be possible to give a short example on how this could work?

chtz · Accepted Answer

Just a partial answer (only addition) and in pseudo code (should be easy to extent to AVX2 intrinsics):

uint8_t a, b;          // input containing two nibbles each

uint8_t c = a + b;     // add with (unwanted) carry between nibbles
uint8_t x = a ^ b ^ c; // bits which are result of a carry
x &= 0x10;             // only bit 4 is of interest
c -= x;                // undo carry of lower to upper nibble

If either a or b is known to have bit 4 unset (i.e. the lowest bit of the upper nibble), it can be left out the computation of x.

As for multiplication: If scale is the same for all products, you can likely get away with some shifting and adding/subtracting (masking out overflow bits where necessarry). Otherwise, I'm afraid you need to mask out 4 bits of each 16bit word, do the operation, and fiddle them together at the end. Pseudo code (there is no AVX 8bit multiplication, so we need to operate with 16bit words):

uint16_t m0=0xf, m1=0xf0, m2=0xf00, m3=0xf000; // masks for each nibble

uint16_t a, b; // input containing 4 nibbles each.

uint16_t p0 = (a*b) & m0; // lowest nibble, does not require masking a,b
uint16_t p1 = ((a>>4) * (b&m1)) & m1;
uint16_t p2 = ((a>>8) * (b&m2)) & m2;
uint16_t p3 = ((a>>12)* (b&m3)) & m3;

uint16_t result = p0 | p1 | p2 | p3;  // join results together

Aki Suihkonen · Answer

For fixed a, b in w[i]=v[i] * a + b, you can simply use a lookup table w_0_3 = _mm_shuffle_epi8(LUT_03, input) for the LSB. Split the input to even and odd nibbles, with the odd LUT preshifted by 4.

auto a = input & 15; // per element
auto b = (input >> 4) & 15; // shift as 16 bits
return LUTA[a] | LUTB[b];

How to generate those LUTs dynamically, is another issue, if at all.

AVX 4-bit integers

Tags:

c++

c

vectorization

avx

intrinsics

user227837

2 Answers

chtz

Aki Suihkonen

Recent Activity

Donate For Us

AVX 4-bit integers

Tags:

c++

c

vectorization

avx

intrinsics

user227837

2 Answers

chtz

Aki Suihkonen

Related questions

Recent Activity

Donate For Us