I have a big pixel processing function which I am currently trying to optimize using intrinsic functions.
Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables.
Basically, I am trying to vectorize the following vanilla C++ code:
//outside loop
const float LUT_RATIO = 1000.0F;
//in loop
float v = ... //input value
v = myLookupTable[static_cast<int>(v * LUT_RATIO)];
What I'm trying:
//outside loop
const __m128 LUT_RATIO = _mm_set1_ps(1000.0F);
//in loop
__m128 v = _mm_set_ps(v1, v2, v3, v4); //input values
__m128i vI = _mm_cvtps_epi32(_mm_mul_ps(v, LUT_RATIO)); //multiply and convert to integers
v = ??? // how to get vI indices of myLookupTable?
edit: ildjarn makes a point that demands clarification on my part. I am not trying to achieve speedup for the lookup table code, I am simply trying to avoid having to store the registers back to floats specifically for doing the lookup, as this part is sandwiched between 2 other parts which could theoretically benefit from SSE.
If you can wait until next year then Intel's Haswell CPUs will have AVX2 which includes instructions for gathered loads. This enables you to do e.g. 8 parallel LUT lookups in one instruction (see e.g. VGATHERDPS
). Other than that, you're out of luck, unless your LUTs are quite small (e.g. 16 elements), in which case you can use PSHUFB
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With