Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

May compiler optimizations be inhibited by multi-threading?

Tags:

It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was lost due to the poor performance of the single threaded case (if compared to the serial version).

The usual explanation that appears on the web for this behavior is that the code generated by compilers may be worse in the multi-threaded case. Anyhow I am not able to find anywhere a reference that explains why the assembly may be worse.

So, what I would like to ask to the compiler guys out there is:

May compiler optimizations be inhibited by multi-threading? In case, how could performance be affected?

If it could help narrowing down the question I am mainly interested in high-performance computing.

Disclaimer: As stated in the comments, part of the answers below may become obsolete in the future as they briefly discuss the way in which optimizations are handled by compilers at the time the question was posed.

like image 630
Massimiliano Avatar asked May 29 '13 07:05

Massimiliano


People also ask

How do I stop compiler optimization?

Use the command-line option -O0 (-[capital o][zero]) to disable optimization, and -S to get assembly file. Look here to see more gcc command-line options.

What factors are considered during the optimization of compiler?

In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power consumption (the last three being popular for portable computers).

What is a compiler optimization tool for code optimization?

Compiler Design Online Training Optimization is a program transformation technique, which tries to improve the code by making it consume less resources (i.e. CPU, Memory) and deliver high speed. In optimization, high-level general programming constructs are replaced by very efficient low-level programming codes.


1 Answers

I think this answer describes the reason sufficiently, but I'll expand a bit here.

Before, however, here's gcc 4.8's documentation on -fopenmp:

-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.

Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.

The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:

int *b = ... int *c = ... int a = 0;  #omp parallel for reduction(+:a) for (i = 0; i < 100; ++i)     a += b[i] + c[i]; 

This code should be converted to something like this:

struct __omp_func1_data {     int start;     int end;     int *b;     int *c;     int a; };  void *__omp_func1(void *data) {     struct __omp_func1_data *d = data;     int i;      d->a = 0;     for (i = d->start; i < d->end; ++i)         d->a += d->b[i] + d->c[i];      return NULL; }  ... for (t = 1; t < nthreads; ++t)     /* create_thread with __omp_func1 function */ /* for master thread, don't create a thread */ struct master_data md = {     .start = /*...*/,     .end = /*...*/     .b = b,     .c = c };  __omp_func1(&md); a += md.a; for (t = 1; t < nthreads; ++t) {     /* join with thread */     /* add thread_data->a to a */ } 

Now if we run this with nthreads==1, the code effectively gets reduced to:

struct __omp_func1_data {     int start;     int end;     int *b;     int *c;     int a; };  void *__omp_func1(void *data) {     struct __omp_func1_data *d = data;     int i;      d->a = 0;     for (i = d->start; i < d->end; ++i)         d->a += d->b[i] + d->c[i];      return NULL; }  ... struct master_data md = {     .start = 0,     .end = 100     .b = b,     .c = c };  __omp_func1(&md); a += md.a; 

So what are the differences between the no openmp version and the single threaded openmp version?

One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)

More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.


To finish this answer, I'd like to show you exactly how -fopenmp affects gcc's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)

Running gcc -Q -v some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed:  -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486  -fstack-protector options enabled:  -falign-loops -fargument-alias -fauto-inc-dec  -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining  -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident  -finline-functions-called-once -fira-share-save-slots  -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore  -fmath-errno -fmerge-debug-strings -fmove-loop-invariants  -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec  -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller  -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im  -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=  -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion  -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model  -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double  -maccumulate-outgoing-args -malign-stringops -mfancy-math-387  -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4  -mpush-args -msahf -mtls-direct-seg-refs 

and running gcc -Q -v -fopenmp some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed:  -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic  -march=i486 -fopenmp -fstack-protector options enabled:  -falign-loops -fargument-alias -fauto-inc-dec  -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining  -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident  -finline-functions-called-once -fira-share-save-slots  -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore  -fmath-errno -fmerge-debug-strings -fmove-loop-invariants  -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec  -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller  -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im  -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=  -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion  -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model  -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double  -maccumulate-outgoing-args -malign-stringops -mfancy-math-387  -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4  -mpush-args -msahf -mtls-direct-seg-refs 

Taking a diff, we can see that the only difference is that with -fopenmp, we have -D_REENTRANT defined (and of course -fopenmp enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.


Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3 will give the same difference. So, even with -O3, there are no optimization's disabled.

like image 114
Shahbaz Avatar answered Oct 19 '22 23:10

Shahbaz