Summing 3 lanes in a NEON float32x4_t

Question

I'm vectorizing an inner loop with ARM NEON intrinsics (llvm, iOS). I'm generally using float32x4_ts. My computation finishes with the need to sum three of the four floats in this vector.

I can drop back to C floats at this point and vst1q_f32 to get the four values out and add up the three I need. But I figure it may be more effective if there's a way to do it directly with the vector in an instruction or two, and then just grab a single lane result, but I couldn't figure out any clear path to doing this.

I'm new to NEON programming, and the existing "documentation" is pretty horrific. Any ideas? Thanks!

auselen · Accepted Answer

You should be able to use VFP unit for such task. NEON and VFP shares the same register bank, meaning you don't need to shuffle around registers to get advantage of one unit and they can also have different views of the same register bits.

Your float32x4_t is 128 bit so it must sit on a Quad (Q) register. If you are solely using arm intrinsic you wouldn't know which one you are using. Problem there is if it is sitting above 4, VFP can't see it as a single precision (for the curious reader: I kept this simple since there are differences between VFP versions and this is the bare minimum requirement.). So it would be best to move your float32x4_t to a fixed register like Q0. After this you can just sum registers like S0, S1, S2 with vadd.f32 and move the result back to an ARM register.

Some warnings... VFP and NEON are theoretically different execution units sharing same register bank and pipeline. I am not sure if this approach is any better than others, I don't need to say but again, you should do benchmark. Also this approach isn't streamlined with neon intrinsic so you probably would need to craft your code with inline assembly.

I did a simple snippet to see how this can look like and I've come up with this:

#include "arm_neon.h"

float32_t sum3() {           
        register float32x4_t v asm ("q0");
        float32_t ret;

        asm volatile(
        "vadd.f32       s0, s1
"
        "vadd.f32       s0, s2
"
        "vmov           %[ret], s0
"
        : [ret] "=r" (ret)
        :
        :);

        return ret;
}

objdump of it looks like (compiled with gcc -O3 -mfpu=neon -mfloat-abi=softfp)

00000000 <sum3>:
   0:   ee30 0a20   vadd.f32    s0, s0, s1
   4:   ee30 0a01   vadd.f32    s0, s0, s2
   8:   ee10 3a10   vmov    r0, s0
   c:   4770        bx  lr
   e:   bf00        nop

I really would like to hear your impressions if you give this a go!

Jesse Rusak · Answer

Can you zero-out the fourth element? Perhaps just by copying it and using vset_lane_f32?

If so, you can use the answers from Sum all elements in a quadword vector in ARM assembly with NEON like:

float32x2_t r = vadd_f32(vget_high_f32(input), vget_low_f32(input));
return vget_lane_f32(vpadd_f32(r, r), 0); // vpadd adds adjacent elements

Though this actually does a bit more work than you need, so it might be faster to just extract the three floats with vget_lane_f32 and add them.

rob mayoff · Answer

It sounds like you want to use (some version of) VLD1 to load zero into your extra lane (unless you can arrange for it to be zero already), followed by two VPADDL instructions to pairwise-sum four lanes into two and then two lanes into one.

Summing 3 lanes in a NEON float32x4_t

Tags:

ios

simd

arm

intrinsics

neon

Ben Zotto

3 Answers

auselen

Jesse Rusak

rob mayoff

Recent Activity

Donate For Us

Summing 3 lanes in a NEON float32x4_t

Tags:

ios

simd

arm

intrinsics

neon

Ben Zotto

3 Answers

auselen

Jesse Rusak

rob mayoff

Related questions

Recent Activity

Donate For Us