Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve the poor performances of OpenMP on Android?

I wrote a image processing app for android (https://play.google.com/store/apps/details?id=cv.cvExperiments) with some C++ code wrapped with JNI. To get some speedup on multicore processors, I annotated expensive loops with openmp "parallel for" directives.

The thing is that on x86, I get some speedup ranging from x3 to x5 on a 4cores proc, but on Android, activating OpenMP (with -fopenmp) does no give any speedup on ARM 32bits and even slow down the code on a 64bits armv8 snapdragon 810.

Did I miss something ? Does anybody could ever observe speedups on android+arm comparable to x86 cpus?

There is lots of tutorial on internet on how to activate OpenMP but no benchmark showing speedups. any pointers?

The only relevant piece of information I found is a benchmark of the OpenMP overhead on armv8, and they also noticed some pretty high overhead : https://wiki.linaro.org/WorkingGroups/Middleware/Graphics/GPGPU/Docs/OpenMPforARMv8PortAnalysis

Thanks, Matthieu

like image 409
Matthieu G Avatar asked Jun 23 '16 07:06

Matthieu G


2 Answers

The problem with multithreading on Android is most likely related to the architecture of many of the CPUs. Snapdragon 810 is a low/high architecture, having 4 strong cores and 4 weak cores.

Specifically, the 810 employs four Cortex-A57 and four Cortex-A53 cores in a big.LITTLE heterogeneous configuration, where all eight cores are available to the OS scheduler.

Without a good worker pool implementation, all additional threads spawned to balance the workload can end up to the low performing cores, which according to my estimations can be approximately three times as slow on heavy SIMD calculations than the strong cores (measured on Samsung Exynos 9611).

The mitagation needs to either use thread affinity to create the additional workers only on the strong cores, or each work load needs to be tailored specifically to capabilities of each core; here the work of 16 chunks is split to 8 cores as 3+3+3+3+1+1+1+1 (with the fast cores having CPU id 4..7).

#pragma omp parallel num_threads(8)
{
   auto tid = omp_get_thread_num();
   uint8_t aff[sizeof(cpu_set_t)] = { 0x80 >> tid };
   sched_setaffinity(0, 1, (cpu_set_t *)aff);

   if (tid < 4) do_task(tid * 3, tid * 3 + 3);
   else do_task(tid+8, tid+9);
}

With OMP the task taking originally 110ms was reduced to 30ms using this approach and to about 37ms delivering the work to just the 4 better cores.

On continuos work loads (e.g. real time signal processing), splitting the work to twice the number of cores seems to allow the linux scheduler to learn the computational requirements and to migrate threads to different cores, but it's not fool proof. (8 cores equals 16 chunks and on the average each fast core will execute 3 chunks and each slow core will execute 1 chunk.)

like image 118
Aki Suihkonen Avatar answered Oct 04 '22 02:10

Aki Suihkonen


After a small benchmark (https://gist.github.com/matt-42/30b7caf73c345c28e55b7cfd82f5540c), I could observe a x2 speedup on a 8-cores armv8. I suppose that the conclusion is that if you can get some speedup on desktop CPU with OpenMP, it does not mean that you will see similar speedups on ARM CPUs.

like image 25
Matthieu G Avatar answered Oct 04 '22 02:10

Matthieu G