I am trying to convert a code written in SSE3 intrinsics to NEON SIMD and am stuck because of a shuffle function.I have looked at the GCC Intrinsics ,ARM manuals and other forums but have not been able to find a solution.
CODE:
_m128i upper = _mm_loadu_si128((__m128i*)p1);
register __m128i mask1 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1);
register __m128i mask2 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1,0x80);
__m128i temp1_upper = _mm_or_si128(_mm_shuffle_epi8(upper,mask1),_mm_shuffle_epi8(upper,mask2));
Though the vtbl1_u8(uint8x8_t,uint8x8_t) instruction creates a lookup table which can be used to assign values to a destination register,It only operates on 64-bit registers .Also the shuffle operation performs a comparison in the starting which has to be done in NEON and I do not know how to do that efficiently.
r0 = (mask0 & 0x80) ? 0 : SELECT(a, mask0 & 0x0f) // SELECT(a,n) extracts nth 8-bit parameter from a.
r1 = (mask1 & 0x80) ? 0 : SELECT(a, mask1 & 0x0f)
...
I cannot find an instruction which first checks the high bit of mask and then selects the lower 4-bits of the mask efficiently.I know that we can compare each bit in the register and then select lower 4 bits if the condition is specified ,But I was hoping to do it efficiently.Hope someone can help or provide a reference.
Thanks a lot,
Cheers!
VTBL returns 0 when the index is out of range.
Since it supports up to two Q registers as the lookup table, it would be quite simple :
That will do the trick.
If you want the bits 4~6 to stay out of the way, you can mask them out prior to vtbl.
Unfortunately, VBIC is absolutely useless for 8bit immediate.
Therefore, you have to sacrifice a register initialized as the bit mask operand.
You just need to use vtbl2_u8
twice, splitting the input and joining the output appropriately:
#define uint8x16_to_8x8x2(v) ((uint8x8x2_t) { vget_low_u8(v), vget_high_u8(v) })
uint8x16_t a = { 0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff };
uint8x16_t b = { 0x80, 0x0f, 0x01, 0x0e, 0x02, 0x0d, 0x03, 0x0c, 0x04, 0x0b, 0x05, 0x0a, 0x06, 0x09, 0x07, 0x08 };
uint8x16_t c = vcombine_u8(vtbl2_u8(uint8x16_to_8x8x2(a), vget_low_u8(b)), vtbl2_u8(uint8x16_to_8x8x2(a), vget_high_u8(b)));
// c = 00 ff 11 ee 22 dd 33 cc 44 bb 55 aa 66 99 77 88
As Jake said, vtbl
returns 0 whenever the index is out of range, so you shouldn't need any special handling for the 0x80
case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With