Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is NEON of ARM faster for integers than floating points?

Tags:

c

arm

neon

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?

like image 246
MetallicPriest Avatar asked May 31 '13 10:05

MetallicPriest


3 Answers

You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).

See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:

You may need to read explanation of how to read those tables.

To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.

In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.

like image 82
auselen Avatar answered Nov 15 '22 15:11

auselen


It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.

Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.

As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.

like image 4
sh1 Avatar answered Nov 15 '22 16:11

sh1


I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.

In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.

For fixed point you will need 32 bit ints to represent at least 24 bits of precision:

  • Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
  • Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.

For floating point:

  • Multiply two-by-two 32 bit floats together. This takes one cycle.

So for this scenario, there is no way in heck that you would ever choose integer over floating point.

If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.

Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.

like image 3
Peter M Avatar answered Nov 15 '22 14:11

Peter M