SSE Instruction to load Bytes with Zero Extension?

Question

Let's say I have a pointer to a bunch of uint8_t's in RDI and I want to load 4 uint8_ts into XMM0 and use SIMD instructions to multiply it with XMM1 where I have 4 float values stored.

How can I load the initial 4 uint8_ts into XMM0, so it's always "aligned", meaning that each "compartment" has it's lower 8 bit with the uint8_t and the upper 24 bits are 0? Is there an instruction for that?

I hope my issue is understandable and I am sorry for my very naive explanation of my issue.

movdqu xmm0, [rdi]

would result in a QWORD loaded, not what I need.

Aki Suihkonen · Accepted Answer

On regular SSE, you need to

  // load 4 bytes (likely you can actually load 8 or 16 bytes for future savings)
  __m128i data = _mm_loadu_si32(input_stream);

  // interleave every byte with zero to cast to (u)int16_t
  data = _mm_unpacklo_epi8(data, _mm_setzero_si128());
  // interleave every word with zero to cast to (u)int32_t
  data = _mm_unpacklo_epi16(data, _mm_setzero_si128());
  // convert the integers to float
  __m128 fdata = _mm_cvtepi32_ps(data);

On SSE4.1, there's a single instruction to expand uint8_t to uint32_t

    __m128i data = _mm_cvtepu8_epi32(_mm_loadu_si32(input_stream));

On SSSE3, one can use vshufb as in

    __m128i data = _mm_cvtsi32_si128(input_stream);
    __m128i shuf = _mm_set_epi8(-1,-1,-1,3,-1,-1,-1,2,-1,-1,-1,1,-1,-1,-1,0);
    data = _mm_shuffle_epi8(data, shuf);

If the next few operations contain addition with a constant (after multiplication with a constant), then one might be able to convert the original data with _mm_shuffle_epi8 into a floating point number of the format float_big + int_small.

Some examples are float_big_23 = 1<<23, float_big_15 = 1<<15, where the formats are 0x4b0000.. or 0x4700..00. One needs a register to contain both the floats and bytes from the stream -- floatX .... d0 d1 d2 d3 d4 d5 d6 d7, as in after reading only the top 8 bytes of a register with __m128 _mm_loadh_pi(input_stream). Then with a proper shuffle the 4 floats of floatX + d0, floatX + d1, floatX + d2, floatX + d3 are generated, requiring the bias to be subtracted. AFAIK, conversion by subtracting the magic value is not faster on any modern x64 than the direct int->float conversion, but in this operation one can bake in further offsets, saving possibly one sub/add instruction, while taking a possible penalty of mixing integer/floating point pipelines.

SSE Instruction to load Bytes with Zero Extension?

Tags:

c

x86

assembly

x86-64

sse

Diana

1 Answers

Aki Suihkonen

Recent Activity

Donate For Us

SSE Instruction to load Bytes with Zero Extension?

Tags:

c

x86

assembly

x86-64

sse

Diana

1 Answers

Aki Suihkonen

Related questions

Recent Activity

Donate For Us