Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum all elements in a quadword vector in ARM assembly with NEON

Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.

like image 732
A Person Avatar asked Aug 03 '11 18:08

A Person


2 Answers

You might try this (it's not in ASM, but you should be able to convert it easily):

float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);

In ASM it would be probably only VADD and VPADD.

I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...

PS. I'm new to NEON too

like image 123
Krystian Bigaj Avatar answered Oct 18 '22 16:10

Krystian Bigaj


It seems that you want to get the sum of a certain length of array, and not only four float values.

In that case, your code will work, but is far from optimized :

  1. many many pipeline interlocks

  2. unnecessary 32bit addition per iteration

Assuming the length of the array is a multiple of 8 and at least 16 :

  vldmia {q0-q1}, [pSrc]!
  sub count, count, #8
loop:
  pld [pSrc, #32]
  vldmia {q3-q4}, [pSrc]!
  subs count, count, #8
  vadd.f32 q0, q0, q3
  vadd.f32 q1, q1, q4
  bgt loop

  vadd.f32 q0, q0, q1
  vpadd.f32 d0, d0, d1
  vadd.f32 s0, s0, s1
  • pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.

I hope the rest of the code above is self explanatory.

You will notice that this version is many times faster than your initial one.

like image 32
Jake 'Alquimista' LEE Avatar answered Oct 18 '22 16:10

Jake 'Alquimista' LEE