The SSE shift instructions I have found can only shift by the same amount on all the elements:
_mm_sll_epi32()
_mm_slli_epi32()
These shift all elements, but by the same shift amount.
Is there a way to apply different shifts to the different elements? Something like this:
__m128i a, __m128i b;
r0:= a0 << b0;
r1:= a1 << b1;
r2:= a2 << b2;
r3:= a3 << b3;
There exists the _mm_shl_epi32()
intrinsic that does exactly that.
http://msdn.microsoft.com/en-us/library/gg445138.aspx
However, it requires the XOP instruction set. Only AMD Bulldozer and Interlagos processors or later have this instruction. It is not available on any Intel processor.
If you want to do it without XOP instructions, you will need to do it the hard way: Pull them out and do them one by one.
Without XOP instructions, you can do this with SSE4.1 using the following intrinsics:
_mm_insert_epi32()
_mm_extract_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse41_reg_ins_ext.htm
Those will let you extract parts of a 128-bit register into regular registers to do the shift and put them back.
If you go with the latter method, it'll be horrifically inefficient. That's why _mm_shl_epi32()
exists in the first place.
Without XOP, your options are limited. If you can control the format of the shift count argument, then you can use _mm_mullo_pi16
since multiplying by a power of two is the same as shifting by that power.
For example, if you want to shift your 8 16-bit elements in an SSE register by <0, 1, 2, 3, 4, 5, 6, 7>
you can multiply by 2 raised to the shift count powers, i.e., by <0, 2, 4, 8, 16, 32, 64, 128>
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With