I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC
flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
I use the -O3
flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC) 1 0.01 0.01 1.00 100 0.04 0.03 0.75 200 0.06 0.04 0.67 500 0.15 0.1 0.67 1000 0.28 0.19 0.68 2000 0.56 0.37 0.66 4000 1.11 0.72 0.65 8000 2.21 1.47 0.67 16000 4.42 2.88 0.65 32000 8.8 5.77 0.66 64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
It might look a bit artificial to add the -fPIC
flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC
. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC
). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC
and slower with it.
It turns out that when you compile without the -fPIC
option multiplyComplex
, sqComplex
, isInSet
and isMandelbrot
are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC
because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD
), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC
, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC
forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
-fPIC
with -fPIE
if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;-fvisibility=hidden
and __attribute__((visibility("default")))
to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;__attribute__((alias ("__f")));
) to refer to library functions from within the library; this would again untie GCC's hands-fno-semantic-interposition
flag that was added in recent GCC versionsIt's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC
(can be overridden with -fsemantic-interposition
to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto
should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC
is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic
is the faster version of -fPIC
, so if you must use -fPIC
try -fpic
instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With