Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summing 3 lanes in a NEON float32x4_t

I'm vectorizing an inner loop with ARM NEON intrinsics (llvm, iOS). I'm generally using float32x4_ts. My computation finishes with the need to sum three of the four floats in this vector.

I can drop back to C floats at this point and vst1q_f32 to get the four values out and add up the three I need. But I figure it may be more effective if there's a way to do it directly with the vector in an instruction or two, and then just grab a single lane result, but I couldn't figure out any clear path to doing this.

I'm new to NEON programming, and the existing "documentation" is pretty horrific. Any ideas? Thanks!

like image 575
Ben Zotto Avatar asked Dec 14 '12 00:12

Ben Zotto


3 Answers

You should be able to use VFP unit for such task. NEON and VFP shares the same register bank, meaning you don't need to shuffle around registers to get advantage of one unit and they can also have different views of the same register bits.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/ch05s03s02.html

Your float32x4_t is 128 bit so it must sit on a Quad (Q) register. If you are solely using arm intrinsic you wouldn't know which one you are using. Problem there is if it is sitting above 4, VFP can't see it as a single precision (for the curious reader: I kept this simple since there are differences between VFP versions and this is the bare minimum requirement.). So it would be best to move your float32x4_t to a fixed register like Q0. After this you can just sum registers like S0, S1, S2 with vadd.f32 and move the result back to an ARM register.

Some warnings... VFP and NEON are theoretically different execution units sharing same register bank and pipeline. I am not sure if this approach is any better than others, I don't need to say but again, you should do benchmark. Also this approach isn't streamlined with neon intrinsic so you probably would need to craft your code with inline assembly.

I did a simple snippet to see how this can look like and I've come up with this:

#include "arm_neon.h"

float32_t sum3() {           
        register float32x4_t v asm ("q0");
        float32_t ret;

        asm volatile(
        "vadd.f32       s0, s1\n"
        "vadd.f32       s0, s2\n"
        "vmov           %[ret], s0\n"
        : [ret] "=r" (ret)
        :
        :);

        return ret;
}

objdump of it looks like (compiled with gcc -O3 -mfpu=neon -mfloat-abi=softfp)

00000000 <sum3>:
   0:   ee30 0a20   vadd.f32    s0, s0, s1
   4:   ee30 0a01   vadd.f32    s0, s0, s2
   8:   ee10 3a10   vmov    r0, s0
   c:   4770        bx  lr
   e:   bf00        nop

I really would like to hear your impressions if you give this a go!

like image 192
auselen Avatar answered Nov 09 '22 05:11

auselen


Can you zero-out the fourth element? Perhaps just by copying it and using vset_lane_f32?

If so, you can use the answers from Sum all elements in a quadword vector in ARM assembly with NEON like:

float32x2_t r = vadd_f32(vget_high_f32(input), vget_low_f32(input));
return vget_lane_f32(vpadd_f32(r, r), 0); // vpadd adds adjacent elements

Though this actually does a bit more work than you need, so it might be faster to just extract the three floats with vget_lane_f32 and add them.

like image 33
Jesse Rusak Avatar answered Nov 09 '22 04:11

Jesse Rusak


It sounds like you want to use (some version of) VLD1 to load zero into your extra lane (unless you can arrange for it to be zero already), followed by two VPADDL instructions to pairwise-sum four lanes into two and then two lanes into one.

like image 25
rob mayoff Avatar answered Nov 09 '22 03:11

rob mayoff