May compiler optimizations be inhibited by multi-threading?

Tags:

It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was lost due to the poor performance of the single threaded case (if compared to the serial version).

The usual explanation that appears on the web for this behavior is that the code generated by compilers may be worse in the multi-threaded case. Anyhow I am not able to find anywhere a reference that explains why the assembly may be worse.

So, what I would like to ask to the compiler guys out there is:

May compiler optimizations be inhibited by multi-threading? In case, how could performance be affected?

If it could help narrowing down the question I am mainly interested in high-performance computing.

Disclaimer: As stated in the comments, part of the answers below may become obsolete in the future as they briefly discuss the way in which optimizations are handled by compilers at the time the question was posed.

630

asked May 29 '13 07:05

Massimiliano

1 Answers

I think this answer describes the reason sufficiently, but I'll expand a bit here.

Before, however, here's gcc 4.8's documentation on -fopenmp:

-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.

Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.

The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:

int *b = ... int *c = ... int a = 0;  #omp parallel for reduction(+:a) for (i = 0; i < 100; ++i)     a += b[i] + c[i];

This code should be converted to something like this:

struct __omp_func1_data {     int start;     int end;     int *b;     int *c;     int a; };  void *__omp_func1(void *data) {     struct __omp_func1_data *d = data;     int i;      d->a = 0;     for (i = d->start; i < d->end; ++i)         d->a += d->b[i] + d->c[i];      return NULL; }  ... for (t = 1; t < nthreads; ++t)     /* create_thread with __omp_func1 function */ /* for master thread, don't create a thread */ struct master_data md = {     .start = /*...*/,     .end = /*...*/     .b = b,     .c = c };  __omp_func1(&md); a += md.a; for (t = 1; t < nthreads; ++t) {     /* join with thread */     /* add thread_data->a to a */ }

Now if we run this with nthreads==1, the code effectively gets reduced to:

struct __omp_func1_data {     int start;     int end;     int *b;     int *c;     int a; };  void *__omp_func1(void *data) {     struct __omp_func1_data *d = data;     int i;      d->a = 0;     for (i = d->start; i < d->end; ++i)         d->a += d->b[i] + d->c[i];      return NULL; }  ... struct master_data md = {     .start = 0,     .end = 100     .b = b,     .c = c };  __omp_func1(&md); a += md.a;

So what are the differences between the no openmp version and the single threaded openmp version?

One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)

More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.

To finish this answer, I'd like to show you exactly how -fopenmp affects gcc's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)

Running gcc -Q -v some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed:  -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486  -fstack-protector options enabled:  -falign-loops -fargument-alias -fauto-inc-dec  -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining  -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident  -finline-functions-called-once -fira-share-save-slots  -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore  -fmath-errno -fmerge-debug-strings -fmove-loop-invariants  -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec  -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller  -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im  -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=  -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion  -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model  -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double  -maccumulate-outgoing-args -malign-stringops -mfancy-math-387  -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4  -mpush-args -msahf -mtls-direct-seg-refs

and running gcc -Q -v -fopenmp some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106 options passed:  -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic  -march=i486 -fopenmp -fstack-protector options enabled:  -falign-loops -fargument-alias -fauto-inc-dec  -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining  -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident  -finline-functions-called-once -fira-share-save-slots  -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore  -fmath-errno -fmerge-debug-strings -fmove-loop-invariants  -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec  -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller  -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im  -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=  -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion  -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model  -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double  -maccumulate-outgoing-args -malign-stringops -mfancy-math-387  -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4  -mpush-args -msahf -mtls-direct-seg-refs

Taking a diff, we can see that the only difference is that with -fopenmp, we have -D_REENTRANT defined (and of course -fopenmp enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.

Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3 will give the same difference. So, even with -O3, there are no optimization's disabled.

114

answered Oct 19 '22 23:10

Shahbaz

Related questions
                            
                                Check if a variable is undef in puppet template
                            
                                Use jQuery to detect whether a device can make telephone calls (supports "tel://" protocol)
                            
                                Rails looking for template for JSON requests
                            
                                In PyCharm, how to navigate to the top of the file?
                            
                                Alter default privileges for a group role in PostgreSQL
                            
                                Howto avoid the "EACCES permission denied" ON SDCARD with KITKAT 4.4.2 Version. New policy from google
                            
                                how can I display all function name from cscope database?
                            
                                How to get the procedure or function name at runtime?
                            
                                Is the asterisk optional when calling a function pointer?
                            
                                SPA: using websockets only. Why not?
                            
                                python to arduino serial read & write
                            
                                What functions or codes require GET_TASKS permission in Android?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With