Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems with Qualcomm Scorpion dual-core ARM NEON code?

I am developing a native library for Android where I use ARM assembly optimizations and multithreading in order to get maximum performance on the dual-core ARM chipset MSM8660. While doing some measurements I noticed the following:

  1. The single-threaded library with NEON optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
  2. The multi-threaded library with ARMv6 optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
  3. The multi-threaded library with NEON optimizations is slower than the single-threaded library with NEON optimizations (definitely not expected!).

I have tried searching all over the net for an explanation for why this is but have so far not found any. It almost seems like all the cores share the same NEON pipeline or something like that, but all schematics seem to indicate that each core should have its own NEON unit. Does anyone know why this is happening?

like image 577
Leo Avatar asked Sep 29 '11 11:09

Leo


1 Answers

First of all, what library are you using ?

You're correct, each core has it's own NEON unit, It is however their own proprietary 'VeNum' unit and not much information is provided about it, It was designed for the Cortex-A8 based Scorpion in 8x50 and was quite better than ARM's own implementation of NEON SIMD, However a good relief is that they (qcom) design their hardware in a way that it's compatible with the base refrence design so most code for a cortex-A8 will work just fine with Scorpion albeit with some performance hit due to possible different instruction timing.

If you're using "softfp" to compile your program, you will have an overhead of approx 20 cycles for every function you call which uses floating point arguments and or uses the NEON unit as transferring register data from the ARM core to Neon unit and vice versa is quite slow and can sometimes stall the core for many cycles waiting for the pipeline to flush.

Also for a threaded program using floating point unit, the kernel has to save the FP registers during a context switch so that incurs additional penalty for threads since we already know moving registers from neon to arm is slow and is known to stall the pipeline.

Additionally many other factors can lead to this such as a bad optimization from compiler, cache miss, not using the double issue feature of scorpion, bad instruction scheduling and switching of your thread from one core to another repeatedly.

like image 72
sgupta Avatar answered Oct 13 '22 10:10

sgupta