Explanation wanted: log10 faster than log and log2, but only with O2 and greater [closed]

Question

I need to use a logarithm function in some of my code, but the base doesn't matter. So I set out to pick between log(), log2(), and log10() by performance, provided I found any significant differences. (I will refer to said functions as ln, lb, and lg respectively).

Why am I fussing about this? Because I will be calling the function as often as 400,000,000 times per iteration of an optimisation algorithm. This is neither optional nor the topic of my question.

I set up some really basic tests, like so:

timespec start, end;
double sum = 0, m;

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int n = 1; n < INT_MAX; ++n)
{
    m = n * 10.1;
    sum += log(m);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

cout << "ln=";
cout << diff(start, end).tv_sec << ":" << diff(start, end).tv_nsec << endl;

... // likewise for log2 and log10

(timespec diff(timespec start, timespec end) if you so desire....)

The following results were obtained:

GCC v4.6.3

-O0
ln=140:516853107
lb=155:878100147
lg=173:534086352

-O1
ln=133:948317112
lb=144:78885393
lg=163:870021712

-O2
ln=9:108117039
lb=9:134447209
lg=4:87951676

-O3
ln=9:102016996
lb=9:204672042
lg=4:153153558

I've looked at the output of compiling with -S, but I really don't have a good enough grip on assembler to fully understand the differences. -S output: -O0 -S, -O3 -S

Why does lg optimise better with O2/O3?

EDIT: Source code, note the typo in the third loop, this is the cause of log10 seeming faster (mult. get optimised out). I have accepted the answer I believe is closest, since the question has now been closed, although I learnt a lot from drhirsch's and janneb's answers.

janneb · Accepted Answer

This is going to depend on the implementation of the log() functions in the C library, compiler version, hardware architecture, and so on. Anyway, below I'm using GCC 4.4 on x86-64 with glibc 2.11.

Changing the example so that I add a line

cout << "sum=" << sum << endl;

which prevents the compiler from optimizing away the log() calls, as I mentioned in a comment, I get the following timings (whole seconds only, -O2):

log: 98s
log2: 105s
log10: 120s

These timings seem roughly consistent with the -O0 and -O1 timings in the original post; at higher optimization levels the log evaluations are optimized away, hence the -O2 and -O3 results are so different.

Furthermore, looking at the log example with the "perf" profiler, the top 5 offenders in the report are


# Samples: 3259205
#
# Overhead         Command              Shared Object  Symbol
# ........  ..............  .........................  ......
#
    87.96%             log  /lib/libm-2.11.1.so        [.] __ieee754_log
     5.51%             log  /lib/libm-2.11.1.so        [.] __log
     2.88%             log  ./log                      [.] main
     2.84%             log  /lib/libm-2.11.1.so        [.] __isnan
     0.69%             log  ./log                      [.] log@plt

Except for main, all the other symbols are related to the log() call. Summing up these, we can conclude that 97% of the total runtime of this example is spent in log().

The implementation of __ieee754_log can be found here in the glibc git repo. Correspondingly, the other implementations are: log2, log10. Note that the previous links are to the HEAD versions, for released version see their corresponding branches

Gunther Piez · Answer

Unfortunately the OP failed to show us the original code, he chose to obfuscate the code slightly converting it to assembly.

In the assembly code the OP linked (annotations by me):

.L10:
    cvtsi2sd %ebx, %xmm0         // convert i to double
    xorpd    %xmm1, %xmm1        // zero 
    mulsd    .LC0(%rip), %xmm0   // multiply i with 10.1
    ucomisd   %xmm0, %xmm1       // compare with zero
    jae       .L31               // always below, never jump

    addl    $1, %ebx             // i++
    cmpl    $2147483647, %ebx    // end of for loop
    jne     .L10
    ...
.L31:
    call    log10, log2, whatever...  // this point is never reached

One can see that the call to log is never executed, especially if you step through it with gdb. All the code does are 2³¹ multiplications and comparisons of a double.

This also explains the stunning increase of execution speed of the log function by a factor of 30 when compiled with -O2, in case anybody found this strange too.

Edit:

for (int n = 1; n < INT_MAX; ++n)
{
    m = n * 10.1;
    sum += log(m);
}

The compiler isn't able to completely optimize the loops away, because she is not able to prove that the call to log will always succeed - it has side effects, if the argument is negative. So she replaces the loop by a comparison with zero - the log is only executed, if the result of the multiplication is less or below zero. Which means it is never executed :-)

What stays in the loop is the multiplication and the test if the result might be negative.

An interesting result happens if I add -ffast-math to the compiler options, which relieves the compiler from the strict IEEE compliance:

ln=0:000000944
lb=0:000000475
lg=0:000000357

Explanation wanted: log10 faster than log and log2, but only with O2 and greater [closed]

Tags:

c++

performance

math

gcc

Iskar Jarak

2 Answers

janneb

Gunther Piez

Recent Activity

Donate For Us

Explanation wanted: log10 faster than log and log2, but only with O2 and greater [closed]

Tags:

c++

performance

math

gcc

Iskar Jarak

2 Answers

janneb

Gunther Piez

Related questions

Recent Activity

Donate For Us