It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was lost due to the poor performance of the single threaded case (if compared to the serial version).
The usual explanation that appears on the web for this behavior is that the code generated by compilers may be worse in the multi-threaded case. Anyhow I am not able to find anywhere a reference that explains why the assembly may be worse.
So, what I would like to ask to the compiler guys out there is:
May compiler optimizations be inhibited by multi-threading? In case, how could performance be affected?
If it could help narrowing down the question I am mainly interested in high-performance computing.
Disclaimer: As stated in the comments, part of the answers below may become obsolete in the future as they briefly discuss the way in which optimizations are handled by compilers at the time the question was posed.
Use the command-line option -O0 (-[capital o][zero]) to disable optimization, and -S to get assembly file. Look here to see more gcc command-line options.
In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power consumption (the last three being popular for portable computers).
Compiler Design Online Training Optimization is a program transformation technique, which tries to improve the code by making it consume less resources (i.e. CPU, Memory) and deliver high speed. In optimization, high-level general programming constructs are replaced by very efficient low-level programming codes.
I think this answer describes the reason sufficiently, but I'll expand a bit here.
Before, however, here's gcc 4.8's documentation on -fopenmp
:
-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.
Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.
The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:
int *b = ... int *c = ... int a = 0; #omp parallel for reduction(+:a) for (i = 0; i < 100; ++i) a += b[i] + c[i];
This code should be converted to something like this:
struct __omp_func1_data { int start; int end; int *b; int *c; int a; }; void *__omp_func1(void *data) { struct __omp_func1_data *d = data; int i; d->a = 0; for (i = d->start; i < d->end; ++i) d->a += d->b[i] + d->c[i]; return NULL; } ... for (t = 1; t < nthreads; ++t) /* create_thread with __omp_func1 function */ /* for master thread, don't create a thread */ struct master_data md = { .start = /*...*/, .end = /*...*/ .b = b, .c = c }; __omp_func1(&md); a += md.a; for (t = 1; t < nthreads; ++t) { /* join with thread */ /* add thread_data->a to a */ }
Now if we run this with nthreads==1
, the code effectively gets reduced to:
struct __omp_func1_data { int start; int end; int *b; int *c; int a; }; void *__omp_func1(void *data) { struct __omp_func1_data *d = data; int i; d->a = 0; for (i = d->start; i < d->end; ++i) d->a += d->b[i] + d->c[i]; return NULL; } ... struct master_data md = { .start = 0, .end = 100 .b = b, .c = c }; __omp_func1(&md); a += md.a;
So what are the differences between the no openmp version and the single threaded openmp version?
One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)
More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.
To finish this answer, I'd like to show you exactly how -fopenmp
affects gcc
's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)
Running gcc -Q -v some_file.c
gives this (relevant) output:
GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed: -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486 -fstack-protector options enabled: -falign-loops -fargument-alias -fauto-inc-dec -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident -finline-functions-called-once -fira-share-save-slots -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double -maccumulate-outgoing-args -malign-stringops -mfancy-math-387 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4 -mpush-args -msahf -mtls-direct-seg-refs
and running gcc -Q -v -fopenmp some_file.c
gives this (relevant) output:
GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed: -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486 -fopenmp -fstack-protector options enabled: -falign-loops -fargument-alias -fauto-inc-dec -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident -finline-functions-called-once -fira-share-save-slots -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double -maccumulate-outgoing-args -malign-stringops -mfancy-math-387 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4 -mpush-args -msahf -mtls-direct-seg-refs
Taking a diff, we can see that the only difference is that with -fopenmp
, we have -D_REENTRANT
defined (and of course -fopenmp
enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.
Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3
will give the same difference. So, even with -O3
, there are no optimization's disabled.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With