Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neon equivalent to SSE intrinsics

I'm trying to convert a c code to an optimized one using neon intrinsics.

Here is the c codes that operate over 2 operants not over vectors of operants.

uint16_t mult_z216(uint16_t a,uint16_t b){
unsigned int c1 = a*b;
    if(c1)
    {
        int c1h = c1 >> 16;
        int c1l = c1 & 0xffff;
        return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff;
    }
    return (1-a-b) & 0xffff;
}

The SEE optimized version of this operation has already been implemented by the following:

#define MULT_Z216_SSE(a, b, c) \
    t0  = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
    (c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers
    (a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers
    (b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates
    (b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
    (b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits
    (c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.
    (a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
    (c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b.
    t0  = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
    (c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

I've almost converted this one using neon intrinsics :

#define MULT_Z216_NEON(a, b, out) \
    temp = vorrq_u16 (*a, *b); \
    // ??
    // ??
    *b = vsubq_u16(*out, *a); \
    *b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \
    *b = vshrq_n_u16(*b, 15); \
    *out = vsubq_s16(*out, *a); \
    *a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \
    *c = vaddq_s16(*c, *b); \
    *temp = vandq_u16(*temp, *a); \
    *out = vsubq_s16(*out, *a);

I'm only missing the neon equivalents of _mm_mullo_epi16 ((a), (b)); and _mm_mulhi_epu16 ((a), (b));. Either I'm misunderstanding something or there is no such intrinsics in NEON. If there no equivalent how to archive theses steps using NEONS intrinsics ?

UPDATE :

I've forgot to emphasize the following point: the operants of the function are uint16x8_t NEON vectors (each element is a uint16_t => integers between 0 and 65535). In a answer someone proposed to use the intrinsic vqdmulhq_s16(). The use of this one won't match the given implementation because the multiplication intrinsic will interpret the vectors as signed values and produce a wrong output.

like image 967
Kami Avatar asked Jul 02 '12 11:07

Kami


2 Answers

You can use:

uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t) 

Which returns a vector of 32 bit products. If you want to break the result up into high and low parts you can use the NEON unzip intrinsic.

like image 55
Guy Sirton Avatar answered Nov 16 '22 21:11

Guy Sirton


vmulq_s16() is the equivalent of _mm_mullo_epi16. There is no exact equivalent of _mm_mulhi_epu16; the closest instruction is vqdmulhq_s16() which is "saturating, doubling, multiply, return high part". It operates on signed 16-bit values only and you will need to divide the input or output by 2 to nullify the doubling.

like image 20
BitBank Avatar answered Nov 16 '22 22:11

BitBank