Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this for loop not faster using OpenMP?

I have extracted this simple member function from a larger 2D program, all it does is a for loop accessing from three different arrays and doing a math operation (1D convolution). I have been testing with using OpenMP to make this particular function faster:

void Image::convolve_lines()
{
  const int *ptr0 = tmp_bufs[0];
  const int *ptr1 = tmp_bufs[1];
  const int *ptr2 = tmp_bufs[2];
  const int width = Width;
#pragma omp parallel for
  for ( int x = 0; x < width; ++x )
    {
    const int sum = 0
      + 1 * ptr0[x]
      + 2 * ptr1[x]
      + 1 * ptr2[x];
    output[x] = sum;
    }
}

If I use gcc 4.7 on debian/wheezy amd64 the overall programm performs a lot slower on an 8 CPUs machine. If I use gcc 4.9 on a debian/jessie amd64 (only 4 CPUs on this machine) the overall program perform with very little difference.

Using time to compare: single core run:

$ ./test black.pgm out.pgm  94.28s user 6.20s system 84% cpu 1:58.56 total

multi core run:

$ ./test black.pgm out.pgm  400.49s user 6.73s system 344% cpu 1:58.31 total

Where:

$ head -3 black.pgm
P5
65536 65536
255

So Width is set to 65536 during execution.

If that matter, I am using cmake for compilation:

add_executable(test test.cxx)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")

And CMAKE_BUILD_TYPE is set to:

CMAKE_BUILD_TYPE:STRING=Release

which implies -O3 -DNDEBUG

My question, why is this for loop not faster using multi-core ? There is no overlap on the array, openmp should split the memory equally. I do not see where bottleneck is coming from ?

EDIT: as it was commented, I changed my input file into:

$ head -3 black2.pgm
P5
33554432 128
255

So Width is now set to 33554432 during execution (should be considered by enough). Now the timing reveals:

single core run:

$ ./test ./black2.pgm out.pgm  100.55s user 5.77s system 83% cpu 2:06.86 total

multi core run (for some reason cpu% was always below 100%, which would indicate no threads at all):

$ ./test ./black2.pgm out.pgm  117.94s user 7.94s system 98% cpu 2:07.63 total
like image 748
malat Avatar asked Nov 01 '22 11:11

malat


1 Answers

I have some general comments:

1. Before optimizing your code, make sure the data is 16 byte aligned. This is extremely important for whatever optimization one wants to apply. And if the data is separated into 3 pieces, it is better to add some dummy elements to make the starting addresses of the 3 pieces are all 16-byte aligned. By doing so, the CPU can load your data into cache lines easily.

2. Make sure the simple function is vectorized before implementing openMP. Most of cases, using AVX/SSE instruction sets should give you a decent 2 to 8X single thread improvement. And it is very simple for your case: create a constant mm256 register and set it with value 2, and load 8 integers to three mm256 registers. With Haswell processor, one addition and one multiplication can be done together. So theoretically, the loop should speed up by a factor 12 if AVX pipeline can be filled!

3. Sometimes parallelization can degrade performance: Modern CPU needs several hundreds to thousands clock cycles to warm up, entering high performance states and scaling up frequency. If the task is not large enough, it is very likely that the task is done before the CPU warms up and one cannot gain speed boost by going parallel. And don't forget that openMP has overhead as well: thread creating, synchronization and deletion. Another case is poor memory management. Data accesses are so scattered, all CPU cores are idle and waiting for data from RAM.

My Suggestion:

You might want to try intel MKL, don't reinvent the wheel. The library is optimized to extreme and there is no clock cycle wasted. One can link with the serial library or the parallel version, a speed boost is guaranteed if it is possible by going parallel.

like image 88
PhD AP EcE Avatar answered Nov 15 '22 10:11

PhD AP EcE