Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using values from `__m256i` to access an array efficiently - SIMD [closed]

Let's say, for example, that I have 2 variables __m256i called rows and cols, the values inside them are:

rows: 0, 2, 7, 5, 7, 2, 3, 0
cols: 1, 2, 7, 5, 7, 2, 2, 6

Now, these values represent the x and y positions for 8 points, so, in this case I would have these points:

p0: [0, 1], p1: [2, 2], p2: [7, 7], p3: [5, 5]
p4: [7, 7], p5: [2, 2], p6: [3, 2], p7: [0, 6]

I also have an array called lut that will have values of int type:

lut: [0, 1, 2, 3, ..., 60, 61, 62, 63]

What I want to do, is to use these positions values from rows and cols variables, access the lut array with it and create a new __m256i value with the lut accessed values.

The way I know of how to do that would be to store rows and cols values in two int arrays of size 8, then read the values from lut array one at a time and then use _mm256_set_epi32() to create the new _m256i value.

This works, but it seems to me to be very inefficient.. So my question is if there is some way to do it faster.

Note that these values are just for a more concrete example, and lut doesn't need to have ordered values or size 64.

thanks!

like image 255
E. B. Avatar asked Aug 07 '17 20:08

E. B.


1 Answers

You can build a solution using an avx2 gather instruction, like so

// index = (rows << 3) + cols;
const __m256i index = _mm256_add_epi32( _mm256_slli_epi32(rows, 3), cols);
// result = lut[index];
const __m256i result = _mm256_i32gather_epi32(lut, index, 4);

Be aware that on current CPUs gather instructions have quite huge latency, so unless you can interleave some instructions before actually using result, this may not be worth using.

To explain the factor of 4: The scale factor in

__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

is considered as actual byte-offset, i.e., the returned value for each index is:

*(const int*)((const char*) base_addr + scale*index)

I don't know if there are many use-cases for that behavior (perhaps this is to make it possible to access a LUT with 1byte or 2byte entries (requiring some masking afterwards)). Perhaps this was just allowed, because scaling by 4 is possible, while scaling by 1/4 or 1/2 would not be (in case someone really needed that).

like image 105
chtz Avatar answered Nov 02 '22 22:11

chtz