The following code compiled with clang
runs almost 60 times faster than the one compiled withgcc
with identical compiler flags (either -O2
or -O3
):
#include <iostream> #include <math.h> #include <chrono> #include <limits> long double func(int num) { long double i=0; long double k=0.7; for(int t=1; t<num; t++){ for(int n=1; n<16; n++){ i += pow(k,n); } } return i; } int main() { volatile auto num = 3000000; // avoid constant folding std::chrono::time_point<std::chrono::system_clock> start, end; start = std::chrono::system_clock::now(); auto i = func(num); end = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed = end-start; std::cout.precision(std::numeric_limits<long double>::max_digits10); std::cout << "Result " << i << std::endl; std::cout << "Elapsed time is " << elapsed.count() << std::endl; return 0; }
I have tested this with three gcc
versions 4.8.4/4.9.2/5.2.1
and two clang
versions 3.5.1/3.6.1
and here are the timings on my machine (for gcc 5.2.1
and clang 3.6.1
):
Timing -O3
:
gcc: 2.41888s clang: 0.0396217s
Timing -O2
:
gcc: 2.41024s clang: 0.0395114s
Timing -O1
:
gcc: 2.41766s clang: 2.43113s
So it seems that gcc
does not optimize this function at all even at higher optimization levels. The assembly output of clang
is almost around 100 lines longer than gcc
and I don't think it is necessary to post it here, all I can say is that in gcc
assembly output there is a call to pow
which does not appear in clang
assembly, presumably because clang
optimizes it to a bunch of intrinsic calls.
Since the results are identical (i.e. i = 6966764.74717416727754
), the question is:
gcc
not optimize this function when clang
can?k
to 1.0
and gcc
becomes as fast, is there a floating point arithmetic issue that gcc
cannot by-pass? I did try static_cast
ing and turned on the warnings to see if there was any issue with implicit conversions, but not really.
Update: For completeness here are the results for -Ofast
gcc: 0.00262204s clang: 0.0013267s
The point is that gcc
does not optimize the code at O2/O3
.
From this godbolt session clang is able to perform all the pow
calculations at compile time. It knows at compile time what the values of k
and n
are and it just constant folds the calculation:
.LCPI0_0: .quad 4604480259023595110 # double 0.69999999999999996 .LCPI0_1: .quad 4602498675187552091 # double 0.48999999999999994 .LCPI0_2: .quad 4599850558606658239 # double 0.34299999999999992 .LCPI0_3: .quad 4597818534454788671 # double 0.24009999999999995 .LCPI0_4: .quad 4595223380205512696 # double 0.16806999999999994 .LCPI0_5: .quad 4593141924544133109 # double 0.11764899999999996 .LCPI0_6: .quad 4590598673379842654 # double 0.082354299999999963 .LCPI0_7: .quad 4588468774839143248 # double 0.057648009999999972 .LCPI0_8: .quad 4585976388698138603 # double 0.040353606999999979 .LCPI0_9: .quad 4583799016135705775 # double 0.028247524899999984 .LCPI0_10: .quad 4581356477717521223 # double 0.019773267429999988 .LCPI0_11: .quad 4579132580613789641 # double 0.01384128720099999 .LCPI0_12: .quad 4576738892963968780 # double 0.0096889010406999918 .LCPI0_13: .quad 4574469401809764420 # double 0.0067822307284899942 .LCPI0_14: .quad 4572123587912939977 # double 0.0047475615099429958
and it unrolls the inner loop:
.LBB0_2: # %.preheader faddl .LCPI0_0(%rip) faddl .LCPI0_1(%rip) faddl .LCPI0_2(%rip) faddl .LCPI0_3(%rip) faddl .LCPI0_4(%rip) faddl .LCPI0_5(%rip) faddl .LCPI0_6(%rip) faddl .LCPI0_7(%rip) faddl .LCPI0_8(%rip) faddl .LCPI0_9(%rip) faddl .LCPI0_10(%rip) faddl .LCPI0_11(%rip) faddl .LCPI0_12(%rip) faddl .LCPI0_13(%rip) faddl .LCPI0_14(%rip)
Note, that it is using a builtin function(gcc documents theirs here) to calculate pow
at compile time and if we use -fno-builtin it no longer performs this optimization.
If you change k
to 1.0
then gcc is able to perform the same optimization:
.L3: fadd %st, %st(1) #, addl $1, %eax #, t cmpl %eax, %edi # t, num fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, fadd %st, %st(1) #, jne .L3 #,
Although it is a simpler case.
If you change the condition for the inner loop to n < 4
then gcc seems willing to optimize when k = 0.7
. As indicated in the comments to the question, if the compiler does not believe unrolling will help then it will likely be conservative in how much unrolling it will do since there is a code size trade off.
As indicated in the comments I am using a modified version of the OP's code in the godbolt examples but it does not change the underlying conclusion.
Note as indicated in a comment above if we use -fno-math-errno, which stops errno
from being set, gcc does apply a similar optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With