I have extracted this simple member function from a larger 2D program, all it does is a for loop accessing from three different arrays and doing a math operation (1D convolution). I have been testing with using OpenMP to make this particular function faster:
void Image::convolve_lines()
{
const int *ptr0 = tmp_bufs[0];
const int *ptr1 = tmp_bufs[1];
const int *ptr2 = tmp_bufs[2];
const int width = Width;
#pragma omp parallel for
for ( int x = 0; x < width; ++x )
{
const int sum = 0
+ 1 * ptr0[x]
+ 2 * ptr1[x]
+ 1 * ptr2[x];
output[x] = sum;
}
}
If I use gcc 4.7 on debian/wheezy amd64 the overall programm performs a lot slower on an 8 CPUs machine. If I use gcc 4.9 on a debian/jessie amd64 (only 4 CPUs on this machine) the overall program perform with very little difference.
Using time
to compare:
single core run:
$ ./test black.pgm out.pgm 94.28s user 6.20s system 84% cpu 1:58.56 total
multi core run:
$ ./test black.pgm out.pgm 400.49s user 6.73s system 344% cpu 1:58.31 total
Where:
$ head -3 black.pgm
P5
65536 65536
255
So Width
is set to 65536
during execution.
If that matter, I am using cmake for compilation:
add_executable(test test.cxx)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")
And CMAKE_BUILD_TYPE is set to:
CMAKE_BUILD_TYPE:STRING=Release
which implies -O3 -DNDEBUG
My question, why is this for
loop not faster using multi-core ? There is no overlap on the array, openmp should split the memory equally. I do not see where bottleneck is coming from ?
EDIT: as it was commented, I changed my input file into:
$ head -3 black2.pgm
P5
33554432 128
255
So Width
is now set to 33554432
during execution (should be considered by enough). Now the timing reveals:
single core run:
$ ./test ./black2.pgm out.pgm 100.55s user 5.77s system 83% cpu 2:06.86 total
multi core run (for some reason cpu% was always below 100%, which would indicate no threads at all):
$ ./test ./black2.pgm out.pgm 117.94s user 7.94s system 98% cpu 2:07.63 total
I have some general comments:
1. Before optimizing your code, make sure the data is 16 byte aligned. This is extremely important for whatever optimization one wants to apply. And if the data is separated into 3 pieces, it is better to add some dummy elements to make the starting addresses of the 3 pieces are all 16-byte aligned. By doing so, the CPU can load your data into cache lines easily.
2. Make sure the simple function is vectorized before implementing openMP. Most of cases, using AVX/SSE instruction sets should give you a decent 2 to 8X single thread improvement. And it is very simple for your case: create a constant mm256 register and set it with value 2, and load 8 integers to three mm256 registers. With Haswell processor, one addition and one multiplication can be done together. So theoretically, the loop should speed up by a factor 12 if AVX pipeline can be filled!
3. Sometimes parallelization can degrade performance: Modern CPU needs several hundreds to thousands clock cycles to warm up, entering high performance states and scaling up frequency. If the task is not large enough, it is very likely that the task is done before the CPU warms up and one cannot gain speed boost by going parallel. And don't forget that openMP has overhead as well: thread creating, synchronization and deletion. Another case is poor memory management. Data accesses are so scattered, all CPU cores are idle and waiting for data from RAM.
My Suggestion:
You might want to try intel MKL, don't reinvent the wheel. The library is optimized to extreme and there is no clock cycle wasted. One can link with the serial library or the parallel version, a speed boost is guaranteed if it is possible by going parallel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With