Why does clang produce a much faster code than gcc for this simple function involving exponentiation?

Question

The following code compiled with clang runs almost 60 times faster than the one compiled withgcc with identical compiler flags (either -O2 or -O3):

#include <iostream> #include <math.h>  #include <chrono> #include <limits>  long double func(int num) {     long double i=0;     long double k=0.7;      for(int t=1; t<num; t++){       for(int n=1; n<16; n++){         i += pow(k,n);       }     }     return i; }   int main() {    volatile auto num = 3000000; // avoid constant folding     std::chrono::time_point<std::chrono::system_clock> start, end;    start = std::chrono::system_clock::now();     auto i = func(num);     end = std::chrono::system_clock::now();    std::chrono::duration<double> elapsed = end-start;    std::cout.precision(std::numeric_limits<long double>::max_digits10);    std::cout << "Result " << i << std::endl;    std::cout << "Elapsed time is " << elapsed.count() << std::endl;     return 0; }

I have tested this with three gcc versions 4.8.4/4.9.2/5.2.1 and two clang versions 3.5.1/3.6.1 and here are the timings on my machine (for gcc 5.2.1 and clang 3.6.1):

Timing -O3:

gcc:    2.41888s clang:  0.0396217s

Timing -O2:

gcc:    2.41024s clang:  0.0395114s

Timing -O1:

gcc:    2.41766s clang:  2.43113s

So it seems that gcc does not optimize this function at all even at higher optimization levels. The assembly output of clang is almost around 100 lines longer than gcc and I don't think it is necessary to post it here, all I can say is that in gcc assembly output there is a call to pow which does not appear in clang assembly, presumably because clang optimizes it to a bunch of intrinsic calls.

Since the results are identical (i.e. i = 6966764.74717416727754), the question is:

Why can gcc not optimize this function when clang can?
Change the value of k to 1.0 and gcc becomes as fast, is there a floating point arithmetic issue that gcc cannot by-pass?

I did try static_casting and turned on the warnings to see if there was any issue with implicit conversions, but not really.

Update: For completeness here are the results for -Ofast

gcc:    0.00262204s clang:  0.0013267s

The point is that gcc does not optimize the code at O2/O3.

Shafik Yaghmour · Accepted Answer

From this godbolt session clang is able to perform all the pow calculations at compile time. It knows at compile time what the values of k and n are and it just constant folds the calculation:

.LCPI0_0:     .quad   4604480259023595110     # double 0.69999999999999996 .LCPI0_1:     .quad   4602498675187552091     # double 0.48999999999999994 .LCPI0_2:     .quad   4599850558606658239     # double 0.34299999999999992 .LCPI0_3:     .quad   4597818534454788671     # double 0.24009999999999995 .LCPI0_4:     .quad   4595223380205512696     # double 0.16806999999999994 .LCPI0_5:     .quad   4593141924544133109     # double 0.11764899999999996 .LCPI0_6:     .quad   4590598673379842654     # double 0.082354299999999963 .LCPI0_7:     .quad   4588468774839143248     # double 0.057648009999999972 .LCPI0_8:     .quad   4585976388698138603     # double 0.040353606999999979 .LCPI0_9:     .quad   4583799016135705775     # double 0.028247524899999984 .LCPI0_10:     .quad   4581356477717521223     # double 0.019773267429999988 .LCPI0_11:     .quad   4579132580613789641     # double 0.01384128720099999 .LCPI0_12:     .quad   4576738892963968780     # double 0.0096889010406999918 .LCPI0_13:     .quad   4574469401809764420     # double 0.0067822307284899942 .LCPI0_14:     .quad   4572123587912939977     # double 0.0047475615099429958

and it unrolls the inner loop:

.LBB0_2:                                # %.preheader     faddl   .LCPI0_0(%rip)     faddl   .LCPI0_1(%rip)     faddl   .LCPI0_2(%rip)     faddl   .LCPI0_3(%rip)     faddl   .LCPI0_4(%rip)     faddl   .LCPI0_5(%rip)     faddl   .LCPI0_6(%rip)     faddl   .LCPI0_7(%rip)     faddl   .LCPI0_8(%rip)     faddl   .LCPI0_9(%rip)     faddl   .LCPI0_10(%rip)     faddl   .LCPI0_11(%rip)     faddl   .LCPI0_12(%rip)     faddl   .LCPI0_13(%rip)     faddl   .LCPI0_14(%rip)

Note, that it is using a builtin function(gcc documents theirs here) to calculate pow at compile time and if we use -fno-builtin it no longer performs this optimization.

If you change k to 1.0 then gcc is able to perform the same optimization:

.L3:     fadd    %st, %st(1) #,     addl    $1, %eax    #, t     cmpl    %eax, %edi  # t, num     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     fadd    %st, %st(1) #,     jne .L3 #,

Although it is a simpler case.

If you change the condition for the inner loop to n < 4 then gcc seems willing to optimize when k = 0.7. As indicated in the comments to the question, if the compiler does not believe unrolling will help then it will likely be conservative in how much unrolling it will do since there is a code size trade off.

As indicated in the comments I am using a modified version of the OP's code in the godbolt examples but it does not change the underlying conclusion.

Note as indicated in a comment above if we use -fno-math-errno, which stops errno from being set, gcc does apply a similar optimization.

Why does clang produce a much faster code than gcc for this simple function involving exponentiation?

Tags:

c++

performance

optimization

gcc

clang

romeric

1 Answers

Shafik Yaghmour

Recent Activity

Donate For Us

Why does clang produce a much faster code than gcc for this simple function involving exponentiation?

Tags:

c++

performance

optimization

gcc

clang

romeric

1 Answers

Shafik Yaghmour

Related questions

Recent Activity

Donate For Us