Let's say, for example, that I have 2 variables __m256i
called rows
and cols
, the values inside them are:
rows: 0, 2, 7, 5, 7, 2, 3, 0
cols: 1, 2, 7, 5, 7, 2, 2, 6
Now, these values represent the x
and y
positions for 8 points, so, in this case I would have these points:
p0: [0, 1], p1: [2, 2], p2: [7, 7], p3: [5, 5]
p4: [7, 7], p5: [2, 2], p6: [3, 2], p7: [0, 6]
I also have an array called lut
that will have values of int
type:
lut: [0, 1, 2, 3, ..., 60, 61, 62, 63]
What I want to do, is to use these positions values from rows
and cols
variables, access the lut
array with it and create a new __m256i
value with the lut
accessed values.
The way I know of how to do that would be to store rows
and cols
values in two int
arrays of size 8, then read the values from lut
array one at a time and then use _mm256_set_epi32()
to create the new _m256i
value.
This works, but it seems to me to be very inefficient.. So my question is if there is some way to do it faster.
Note that these values are just for a more concrete example, and lut
doesn't need to have ordered values or size 64.
thanks!
You can build a solution using an avx2 gather instruction, like so
// index = (rows << 3) + cols;
const __m256i index = _mm256_add_epi32( _mm256_slli_epi32(rows, 3), cols);
// result = lut[index];
const __m256i result = _mm256_i32gather_epi32(lut, index, 4);
Be aware that on current CPUs gather instructions have quite huge latency, so unless you can interleave some instructions before actually using result
, this may not be worth using.
To explain the factor of 4: The scale
factor in
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
is considered as actual byte-offset, i.e., the returned value for each index is:
*(const int*)((const char*) base_addr + scale*index)
I don't know if there are many use-cases for that behavior (perhaps this is to make it possible to access a LUT with 1byte or 2byte entries (requiring some masking afterwards)). Perhaps this was just allowed, because scaling by 4 is possible, while scaling by 1/4 or 1/2 would not be (in case someone really needed that).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With