NEON ASM code running much slower than C code?





I'm trying to implement Gauss-Newton optimization for a specific problem on iPhone ARM using NEON. The first function below is my original C function. The second is the NEON asm code I wrote. I ran each one 100,000 times and the NEON version takes 7-8 times longer than C version. I think the loading (vld1.32) is what takes most of the time. I experimented by taking removing some instructions.

Does anyone have any insight into this problem? Thanks!

template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
    Jtr[0] -= J[0]*residual;
    Jtr[1] -= J[1]*residual;
    Jtr[2] -= J[2]*residual;
    Jtr[3] -= J[3]*residual;
    Jtr[4] -= J[4]*residual;
    Jtr[5] -= J[5]*residual;
    Jtr[6] -= J[6]*residual;
    Jtr[7] -= J[7]*residual;    

inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
    __asm__ volatile (
                      // load Jtr into registers
                      "vld1.32   {d0-d3}, [%0]\n\t"
                      // load J into registers
                      "vld1.32   {d4-d7}, [%1]\n\t"
                      // load residual in register
                      "vmov.f32  s16, %2\n\t"
                      // Jtr -= J*residual
                      "vmls.f32  q0, q2, d8[0]\n\t"
                      "vmls.f32  q1, q3, d8[0]\n\t"
                      // store result
                      "vst1.32   {d0-d3}, [%0]\n\t"
                      // output
                      // input
                      : "r"(Jtr), "r"(J), "r"(residual)
                      // registers
                      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
  1. Don't use d8-d15. They have to be conserved onto stack prior to use. And restored after. The compiler will put instructions doing this, wasting valuable cycles.
  2. Load J prior to Jtr. Jtr is expected at a later pipeline stage than J.
  3. Use VLDMIA/VSTMIA instead of VLD/VST. VLDMIA/VSTMIA is faster and has advantage pipeline-wise.
  4. Use vector-vector multiplication instead of vector-scalar multiplication.
  5. If you create a looped version, put pld at the beginning and unroll the loop so that 64bytes are read from each pointer per iteration.

Beside those faults I mentioned above - which is typical for people new to NEON - Your approach is very nice. You found the most appropriate instruction in vmls.

Well done.


__asm__ volatile (
    // load residual in register
    "vdup.32  q12, %2\n\t"
    // load J into registers
    "vldmia   %1, {q10-q11}\n\t"
     // load Jtr into registers
    "vldmia   %0, {q8-q9}\n\t"
    // Jtr -= J*residual
    "vmls.f32  q8, q10, q12\n\t"
    "vmls.f32  q9, q11, q12\n\t"
    // store result
    "vstmia   %0, {q8-q9}\n\t"
    // output
    // input
    : "r"(Jtr), "r"(J), "r"(residual)
    // registers
    : "q8", "q9", "q10", "q11", "q12"
