I'm trying to convert a c code to an optimized one using neon intrinsics.
Here is the c codes that operate over 2 operants not over vectors of operants.
uint16_t mult_z216(uint16_t a,uint16_t b){
unsigned int c1 = a*b;
if(c1)
{
int c1h = c1 >> 16;
int c1l = c1 & 0xffff;
return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff;
}
return (1-a-b) & 0xffff;
}
The SEE optimized version of this operation has already been implemented by the following:
#define MULT_Z216_SSE(a, b, c) \
t0 = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
(c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers
(a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers
(b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates
(b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
(b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits
(c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.
(a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
(c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b.
t0 = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
(c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.
I've almost converted this one using neon intrinsics :
#define MULT_Z216_NEON(a, b, out) \
temp = vorrq_u16 (*a, *b); \
// ??
// ??
*b = vsubq_u16(*out, *a); \
*b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \
*b = vshrq_n_u16(*b, 15); \
*out = vsubq_s16(*out, *a); \
*a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \
*c = vaddq_s16(*c, *b); \
*temp = vandq_u16(*temp, *a); \
*out = vsubq_s16(*out, *a);
I'm only missing the neon equivalents of _mm_mullo_epi16 ((a), (b));
and _mm_mulhi_epu16 ((a), (b));
. Either I'm misunderstanding something or there is no such intrinsics in NEON. If there no equivalent how to archive theses steps using NEONS intrinsics ?
UPDATE :
I've forgot to emphasize the following point: the operants of the function are uint16x8_t NEON vectors (each element is a uint16_t => integers between 0 and 65535). In a answer someone proposed to use the intrinsic vqdmulhq_s16()
. The use of this one won't match the given implementation because the multiplication intrinsic will interpret the vectors as signed values and produce a wrong output.
You can use:
uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t)
Which returns a vector of 32 bit products. If you want to break the result up into high and low parts you can use the NEON unzip intrinsic.
vmulq_s16() is the equivalent of _mm_mullo_epi16. There is no exact equivalent of _mm_mulhi_epu16; the closest instruction is vqdmulhq_s16() which is "saturating, doubling, multiply, return high part". It operates on signed 16-bit values only and you will need to divide the input or output by 2 to nullify the doubling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With