I have edited my question after previous comments (especially @Zboson) for better readability
I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)
Here is my test code:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main() {
const int n = 256*8192*100;
double *A, *B;
posix_memalign((void**)&A, 64, n*sizeof(double));
posix_memalign((void**)&B, 64, n*sizeof(double));
for (int i = 0; i < n; ++i) {
A[i] = 0.1;
B[i] = 0.0;
}
double start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
B[i] = exp(A[i]) + sin(B[i]);
}
double end = omp_get_wtime();
double sum = 0.0;
for (int i = 0; i < n; ++i) {
sum += B[i];
}
printf("%g %g\n", end - start, sum);
return 0;
}
When I compile it using gcc 4.9-4.9-20140209
, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q
I see the following performance as I change OMP_NUM_THREADS
[the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]:
The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1:
The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.
Even more weirdly, the speed-up is 35 times!
Can anyone explain this?
I also tested this with much smaller arrays 8192*4, and see similar performance scaling.
In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):
for i in {1..128}; do
for k in {1..5}; do
export OMP_NUM_THREADS=$i;
echo -ne $i $k "";
./a.out;
done;
done > out
EDIT: Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves!
EDIT: I tried @Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):
The problem is likely due to the clock()
function. It does not return the wall time on Linux. You should use the function omp_get_wtime()
. It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.
I tested your code with it here http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2
Edit: Another thing to consider which may be causing your problem is that exp
and sin
function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma
Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper
(VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.
#include <immintrin.h>
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
_mm256_zeroupper();
B[i] = sin(B[i]);
_mm256_zeroupper();
B[i] += exp(A[i]);
}
Edit A simpler way to test do this is to instead of compiling with -march=native
don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa
) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa
).
Edit GCC 4.8 has an option -mvzeroupper
which may be the most convenient solution.
This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With