The other day I ran into a weird problem using GCC and the '-Ofast' optimization flag. Compiling the below program using 'gcc -Ofast -o fib1 fib1.c'.
#include <stdio.h>
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
return a + b;
}
int main(){
printf("%d", f1(40));
}
When measuring execution time, the result is:
peter@host ~ $ time ./fib1
102334155
real 0m0.511s
user 0m0.510s
sys 0m0.000s
Now let's introduce a global variable in our program and compile again using 'gcc -Ofast -o fib2 fib2.c'.
#include <stdio.h>
int global;
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
global = 0;
return a + b;
}
int main(){
printf("%d", f1(40));
}
Now the execution time is:
peter@host ~ $ time ./fib2
102334155
real 0m0.265s
user 0m0.265s
sys 0m0.000s
The new global variable does not do anything meaningful. However, the difference in execution time is considerable.
Apart from the question (1) what the reason is for such behavior, it also would be nice if (2) the last performance could be achieved without introducing meaningless variables. Any suggestions?
Thanks Peter
I believe you hit some very clever and very weird gcc (mis-?)optimization. That's about as far as I got in researching this.
I modified your code to have an #ifdef G
around the global:
$ cc -O3 -o foo foo.c && time ./foo
102334155
real 0m0.634s
user 0m0.631s
sys 0m0.001s
$ cc -O3 -DG -o foo foo.c && time ./foo
102334155
real 0m0.365s
user 0m0.362s
sys 0m0.001s
So I have the same weird performance difference.
When in doubt, read the generated assembler.
$ cc -S -O3 -o foo.s -S foo.c
$ cc -S -DG -O3 -o foog.s -S foo.c
Here it gets truly bizarre. Normally I can follow gcc-generated code pretty easily. The code that got generated here is just incomprehensible. What should be pretty straightforward recursion and addition that should fit in 15-20 instructions, gcc expanded to a several hundred instructions with a flurry of shifts, additions, subtractions, compares, branches and a large array on the stack. It looks like it tried to partially convert one or both recursions into an iteration and then unrolled that loop. One thing struck me though, the non-global function had only one recursive call to itself (the second one is the call from main):
$ grep 'call.*f1' foo.s | wc
2 4 18
While the global one one had:
$ grep 'call.*f1' foog.s | wc
33 66 297
My educated (I've seen this many times before) guess? Gcc tried to be clever and in its fervor the function that in theory should be easier to optimize generated worse code while the write to the global variable made it sufficiently confused that it couldn't optimize so hard that it led to better code. This happens all the time, many optimizations that gcc (and other compilers too, let's not single them out) uses are very specific to certain benchmarks they use and might not generate faster running code in many other cases. In fact, from experience I only ever use -O2 unless I've benchmarked things very carefully to see that -O3 makes sense. It very often doesn't.
If you really want to research this further, I'd recommend reading gcc documentation about which optimizations get enabled with -O3
as opposed to -O2
(-O2
doesn't do this), then try them one by one until you find which one causes this behavior and that optimization should be a pretty good hint for what's going on. I was about to do this, but I ran out of time (must do last minute christmas shopping).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With