I'm trying to implement Gauss-Newton optimization for a specific problem on iPhone ARM using NEON. The first function below is my original C function. The second is the NEON asm code I wrote. I ran each one 100,000 times and the NEON version takes 7-8 times longer than C version. I think the loading (vld1.32) is what takes most of the time. I experimented by taking removing some instructions.
Does anyone have any insight into this problem? Thanks!
template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
{
Jtr[0] -= J[0]*residual;
Jtr[1] -= J[1]*residual;
Jtr[2] -= J[2]*residual;
Jtr[3] -= J[3]*residual;
Jtr[4] -= J[4]*residual;
Jtr[5] -= J[5]*residual;
Jtr[6] -= J[6]*residual;
Jtr[7] -= J[7]*residual;
}
inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
{
__asm__ volatile (
// load Jtr into registers
"vld1.32 {d0-d3}, [%0]\n\t"
// load J into registers
"vld1.32 {d4-d7}, [%1]\n\t"
// load residual in register
"vmov.f32 s16, %2\n\t"
// Jtr -= J*residual
"vmls.f32 q0, q2, d8[0]\n\t"
"vmls.f32 q1, q3, d8[0]\n\t"
// store result
"vst1.32 {d0-d3}, [%0]\n\t"
// output
:
// input
: "r"(Jtr), "r"(J), "r"(residual)
// registers
: "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
);
}
Beside those faults I mentioned above - which is typical for people new to NEON - Your approach is very nice. You found the most appropriate instruction in vmls.
Well done.
{
__asm__ volatile (
// load residual in register
"vdup.32 q12, %2\n\t"
// load J into registers
"vldmia %1, {q10-q11}\n\t"
// load Jtr into registers
"vldmia %0, {q8-q9}\n\t"
// Jtr -= J*residual
"vmls.f32 q8, q10, q12\n\t"
"vmls.f32 q9, q11, q12\n\t"
// store result
"vstmia %0, {q8-q9}\n\t"
// output
:
// input
: "r"(Jtr), "r"(J), "r"(residual)
// registers
: "q8", "q9", "q10", "q11", "q12"
);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With