I am trying to extract 4 bytes out of a 128 bit register in efficient way. The problem is that each value is in a sperate 32bit {120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0}
. I want to transform the 128 bit to 32 bit it the form {120,55,42,120}
.
The "raw" code looks like the following:
__m128i byte_result_vec={120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0};
unsigned char * byte_result_array=(unsigned char*)&byte_result_vec;
result_array[x]=byte_result_array[0];
result_array[x+1]=byte_result_array[4];
result_array[x+2]=byte_result_array[8];
result_array[x+3]=byte_result_array[12];
My SSSE3 code is:
unsigned int * byte_result_array=...;
__m128i byte_result_vec={120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0};
const __m128i eight_bit_shuffle_mask=_mm_set_epi8(1,1,1,1,1,1,1,1,1,1,1,1,0,4,8,12);
byte_result_vec=_mm_shuffle_epi8(byte_result_vec,eight_bit_shuffle_mask);
unsigned int * byte_result_array=(unsigned int*)&byte_result_vec;
result_array[x]=byte_result_array[0];
How can I do this efficiently with SSE2. Is there a better version with SSSE3 or SSE4?
You can look at a previous answer of mine for some solutions to this and the reverse operation.
In particular in SSE2 you can do it by first packing the 32-bit integers into signed 16-bit integers and saturating:
byte_result_vec = _mm_packs_epi32(byte_result_vec, byte_result_vec);
Then we pack those 16-bit values into unsigned 8-bit values using unsigned saturation:
byte_result_vec = _mm_packus_epi16(byte_result_vec, byte_result_vec);
We can then finally take our values from the lower 32-bits of the register:
int int_result = _mm_cvtsi128_si32(byte_result_vec);
unsigned char* byte_result_array = (unsigned char*)&int_result;
result_array[x] = byte_result_array[0];
result_array[x+1] = byte_result_array[1];
result_array[x+2] = byte_result_array[2];
result_array[x+3] = byte_result_array[3];
EDIT: The above assumes that the 8-bit words are initially in the low bytes of their respective 32-bit words and the rest is filled with 0
s, since otherwise their will get clamped during the saturating packing process. Thus the operations are the following:
byte 15 0
0 0 0 D 0 0 0 C 0 0 0 B 0 0 0 A
_mm_packs_epi32 -> 0 D 0 C 0 B 0 A 0 D 0 C 0 B 0 A
_mm_packus_epi16 -> D C B A D C B A D C B A D C B A
^^^^^^^
_mm_cvtsi128_si32 -> int DCBA, laid out in x86 memory as bytes A B C D
-> reinterpreted as unsigned char array { A, B, C, D }
If the uninterresting bytes are not filled with 0
s initially, you have to mask them away beforehand:
byte_result_vec = _mm_and_si128(byte_result_vec, _mm_set1_epi32(0x000000FF));
Or if the interresting bytes are initially in the high bytes, you have to shift them into the low bytes beforehand:
byte_result_vec = _mm_srli_epi32(byte_result_vec, 24);
Or, if you actually want { D, C, B, A }
(which is not completely clear to me from your question), well, then this amounts to just switching the array index in the assignments (or alternively perfoming a 32-bit shuffle (_mm_shuffle_epi32
) on the initial SSE register beforehand).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With