I'm vectorizing an inner loop with ARM NEON intrinsics (llvm, iOS). I'm generally using float32x4_t
s. My computation finishes with the need to sum three of the four floats in this vector.
I can drop back to C floats at this point and vst1q_f32
to get the four values out and add up the three I need. But I figure it may be more effective if there's a way to do it directly with the vector in an instruction or two, and then just grab a single lane result, but I couldn't figure out any clear path to doing this.
I'm new to NEON programming, and the existing "documentation" is pretty horrific. Any ideas? Thanks!
You should be able to use VFP unit for such task. NEON and VFP shares the same register bank, meaning you don't need to shuffle around registers to get advantage of one unit and they can also have different views of the same register bits.
Your float32x4_t
is 128 bit so it must sit on a Quad (Q) register. If you are solely using arm intrinsic you wouldn't know which one you are using. Problem there is if it is sitting above 4, VFP can't see it as a single precision (for the curious reader: I kept this simple since there are differences between VFP versions and this is the bare minimum requirement.). So it would be best to move your float32x4_t
to a fixed register like Q0
. After this you can just sum registers like S0, S1, S2 with vadd.f32
and move the result back to an ARM register.
Some warnings... VFP and NEON are theoretically different execution units sharing same register bank and pipeline. I am not sure if this approach is any better than others, I don't need to say but again, you should do benchmark. Also this approach isn't streamlined with neon intrinsic so you probably would need to craft your code with inline assembly.
I did a simple snippet to see how this can look like and I've come up with this:
#include "arm_neon.h"
float32_t sum3() {
register float32x4_t v asm ("q0");
float32_t ret;
asm volatile(
"vadd.f32 s0, s1\n"
"vadd.f32 s0, s2\n"
"vmov %[ret], s0\n"
: [ret] "=r" (ret)
:
:);
return ret;
}
objdump
of it looks like (compiled with gcc -O3 -mfpu=neon -mfloat-abi=softfp)
00000000 <sum3>:
0: ee30 0a20 vadd.f32 s0, s0, s1
4: ee30 0a01 vadd.f32 s0, s0, s2
8: ee10 3a10 vmov r0, s0
c: 4770 bx lr
e: bf00 nop
I really would like to hear your impressions if you give this a go!
Can you zero-out the fourth element? Perhaps just by copying it and using vset_lane_f32
?
If so, you can use the answers from Sum all elements in a quadword vector in ARM assembly with NEON like:
float32x2_t r = vadd_f32(vget_high_f32(input), vget_low_f32(input));
return vget_lane_f32(vpadd_f32(r, r), 0); // vpadd adds adjacent elements
Though this actually does a bit more work than you need, so it might be faster to just extract the three floats with vget_lane_f32
and add them.
It sounds like you want to use (some version of) VLD1 to load zero into your extra lane (unless you can arrange for it to be zero already), followed by two VPADDL instructions to pairwise-sum four lanes into two and then two lanes into one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With