I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I'm currently using 3 OR operations, and 2 MOVs:
uint32x4_t vr = vorrq_u32(vcmp0, vcmp1);
uint64x2_t v0 = vreinterpretq_u64_u32(vr);
uint64x1_t v0or = vorr_u64(vget_high_u64(v0), vget_low_u64(v0));
uint32x2_t v1 = vreinterpret_u32_u64 (v0or);
uint32_t r = vget_lane_u32(v1, 0) | vget_lane_u32(v1, 1);
if (r == 0) { // do stuff }
This translates by gcc to the following assembly code:
VORR q9, q9, q10
VORR d16, d18, d19
VMOV.32 r3, d16[0]
VMOV.32 r2, d16[1]
VORRS r2, r2, r3
BEQ ...
Does anyone have an idea of a faster way?
While this answer may be a bit late, there is a simple way to do the test with only 3 instructions and no extra registers:
inline uint32_t is_not_zero(uint32x4_t v)
{
uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v));
return vget_lane_u32(vpmax_u32(tmp, tmp), 0);
}
The return value will be nonzero if any bit in the 128-bit NEON register was set.
If you're targeting AArch64 NEON, you can use the following to get a value to test with just two instructions:
inline uint64_t is_not_zero(uint32x4_t v)
{
uint64x2_t v64 = vreinterpretq_u64_u32(v);
uint32x2_t v32 = vqmovn_u64(v64);
uint64x1_t result = vreinterpret_u64_u32(v32);
return result[0];
}
You seem to be looking for intrinsics and this is the way:
inline bool is_zero(int32x4_t v) noexcept
{
v = v == int32x4{};
return !int32x2_t(
vtbl2_s8(
int8x8x2_t{
int8x8_t(vget_low_s32(v)),
int8x8_t(vget_high_s32(v))
},
int8x8_t{0, 4, 8, 12}
)
)[0];
}
Nils Pipenbrinck's answer has a flaw in that he assumes the QC, cumulative saturation flag to be clear.
If you have AArch64 you can do it even easier. They have a new instruction for designed for this.
inline uint32_t is_not_zero(uint32x4_t v)
{
return vaddvq_u32(v);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With