Let's say I have a pointer to a bunch of uint8_t's in RDI and I want to load 4 uint8_ts into XMM0 and use SIMD instructions to multiply it with XMM1 where I have 4 float values stored.
How can I load the initial 4 uint8_ts into XMM0, so it's always "aligned", meaning that each "compartment" has it's lower 8 bit with the uint8_t and the upper 24 bits are 0? Is there an instruction for that?
I hope my issue is understandable and I am sorry for my very naive explanation of my issue.
movdqu xmm0, [rdi]
would result in a QWORD loaded, not what I need.
On regular SSE, you need to
// load 4 bytes (likely you can actually load 8 or 16 bytes for future savings)
__m128i data = _mm_loadu_si32(input_stream);
// interleave every byte with zero to cast to (u)int16_t
data = _mm_unpacklo_epi8(data, _mm_setzero_si128());
// interleave every word with zero to cast to (u)int32_t
data = _mm_unpacklo_epi16(data, _mm_setzero_si128());
// convert the integers to float
__m128 fdata = _mm_cvtepi32_ps(data);
On SSE4.1, there's a single instruction to expand uint8_t to uint32_t
__m128i data = _mm_cvtepu8_epi32(_mm_loadu_si32(input_stream));
On SSSE3, one can use vshufb as in
__m128i data = _mm_cvtsi32_si128(input_stream);
__m128i shuf = _mm_set_epi8(-1,-1,-1,3,-1,-1,-1,2,-1,-1,-1,1,-1,-1,-1,0);
data = _mm_shuffle_epi8(data, shuf);
If the next few operations contain addition with a constant (after multiplication with a constant), then one might be able to convert the original data with _mm_shuffle_epi8 into a floating point number of the format float_big + int_small.
Some examples are float_big_23 = 1<<23, float_big_15 = 1<<15, where the formats are 0x4b0000.. or 0x4700..00. One needs a register to contain both the floats and bytes from the stream -- floatX .... d0 d1 d2 d3 d4 d5 d6 d7, as in after reading only the top 8 bytes of a register with __m128 _mm_loadh_pi(input_stream). Then with a proper shuffle the 4 floats of floatX + d0, floatX + d1, floatX + d2, floatX + d3 are generated, requiring the bias to be subtracted. AFAIK, conversion by subtracting the magic value is not faster on any modern x64 than the direct int->float conversion, but in this operation one can bake in further offsets, saving possibly one sub/add instruction, while taking a possible penalty of mixing integer/floating point pipelines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With