Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to test a 128 bit NEON register for a value of 0 using intrinsics?

Tags:

neon

I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I'm currently using 3 OR operations, and 2 MOVs:

uint32x4_t vr = vorrq_u32(vcmp0, vcmp1);

uint64x2_t v0 = vreinterpretq_u64_u32(vr);
uint64x1_t v0or = vorr_u64(vget_high_u64(v0), vget_low_u64(v0));

uint32x2_t v1 = vreinterpret_u32_u64 (v0or);
uint32_t r = vget_lane_u32(v1, 0) | vget_lane_u32(v1, 1);

if (r == 0) { // do stuff }

This translates by gcc to the following assembly code:

VORR     q9, q9, q10
VORR     d16, d18, d19
VMOV.32  r3, d16[0]
VMOV.32  r2, d16[1]
VORRS    r2, r2, r3
BEQ      ...

Does anyone have an idea of a faster way?

like image 515
miluz Avatar asked Mar 13 '13 15:03

miluz


4 Answers

While this answer may be a bit late, there is a simple way to do the test with only 3 instructions and no extra registers:

inline uint32_t is_not_zero(uint32x4_t v)
{
    uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v));
    return vget_lane_u32(vpmax_u32(tmp, tmp), 0);
}

The return value will be nonzero if any bit in the 128-bit NEON register was set.

like image 58
Henri Ylitie Avatar answered Nov 12 '22 13:11

Henri Ylitie


If you're targeting AArch64 NEON, you can use the following to get a value to test with just two instructions:

inline uint64_t is_not_zero(uint32x4_t v)
{
    uint64x2_t v64 = vreinterpretq_u64_u32(v);
    uint32x2_t v32 = vqmovn_u64(v64);
    uint64x1_t result = vreinterpret_u64_u32(v32);
    return result[0];
}
like image 31
jtlim Avatar answered Nov 12 '22 14:11

jtlim


You seem to be looking for intrinsics and this is the way:

inline bool is_zero(int32x4_t v) noexcept
{
  v = v == int32x4{};

  return !int32x2_t(
    vtbl2_s8(
      int8x8x2_t{
        int8x8_t(vget_low_s32(v)),
        int8x8_t(vget_high_s32(v))
      },
      int8x8_t{0, 4, 8, 12}
    )
  )[0];
}

Nils Pipenbrinck's answer has a flaw in that he assumes the QC, cumulative saturation flag to be clear.

like image 45
user1095108 Avatar answered Nov 12 '22 14:11

user1095108


If you have AArch64 you can do it even easier. They have a new instruction for designed for this.

inline uint32_t is_not_zero(uint32x4_t v)
{
    return vaddvq_u32(v);
}
like image 1
Allan Sandfeld Jensen Carewolf Avatar answered Nov 12 '22 13:11

Allan Sandfeld Jensen Carewolf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!