The program runs 3 times slower when compiled with g++ 5.3.1 than the same program compiled with g++ 4.8.4, the same command

Question

Recently, I've started to use Ubuntu 16.04 with g++ 5.3.1 and checked that my program runs 3 times slower. Before that I've used Ubuntu 14.04, g++ 4.8.4. I built it with the same commands: CFLAGS = -std=c++11 -Wall -O3.

My program contains cycles, filled with math calls (sin, cos, exp). You can find it here.

I've tried to compile with different optimization flags (O0, O1, O2, O3, Ofast), but in all cases the problem is reproduced (with Ofast both variants run faster, but the first runs 3 times slower still).

In my program I use libtinyxml-dev, libgslcblas. But they have the same versions in both cases and don't take any significant part in the program (according to code and callgrind profiling) in terms of performance.

I've performed profiling, but it doesn't give me any idea about why it happens. Kcachegrind comparison (left is slower). I've only noticed that now the program uses libm-2.23 compared to libm-2.19 with Ubuntu 14.04.

My processor is i7-5820, Haswell.

I have no idea why it becomes slower. Do you have any ideas?

P.S. Below you can find the most time consuming function:

void InclinedSum::prepare3D()
{
double buf1, buf2;
double sum_prev1 = 0.0, sum_prev2 = 0.0;
int break_idx1, break_idx2; 
int arr_idx;

for(int seg_idx = 0; seg_idx < props->K; seg_idx++)
{
    const Point& r = well->segs[seg_idx].r_bhp;

    for(int k = 0; k < props->K; k++)
    {
        arr_idx = seg_idx * props->K + k;
        F[arr_idx] = 0.0;

        break_idx2 = 0;

        for(int m = 1; m <= props->M; m++)
        {
            break_idx1 = 0;

            for(int l = 1; l <= props->L; l++)
            {
                buf1 = ((cos(M_PI * (double)(m) * well->segs[k].r1.x / props->sizes.x - M_PI * (double)(l) * well->segs[k].r1.z / props->sizes.z) -
                            cos(M_PI * (double)(m) * well->segs[k].r2.x / props->sizes.x - M_PI * (double)(l) * well->segs[k].r2.z / props->sizes.z)) /
                        ( M_PI * (double)(m) * tan(props->alpha) / props->sizes.x + M_PI * (double)(l) / props->sizes.z ) + 
                            (cos(M_PI * (double)(m) * well->segs[k].r1.x / props->sizes.x + M_PI * (double)(l) * well->segs[k].r1.z / props->sizes.z) -
                            cos(M_PI * (double)(m) * well->segs[k].r2.x / props->sizes.x + M_PI * (double)(l) * well->segs[k].r2.z / props->sizes.z)) /
                        ( M_PI * (double)(m) * tan(props->alpha) / props->sizes.x - M_PI * (double)(l) / props->sizes.z )
                            ) / 2.0;

                buf2 = sqrt((double)(m) * (double)(m) / props->sizes.x / props->sizes.x + (double)(l) * (double)(l) / props->sizes.z / props->sizes.z);

                for(int i = -props->I; i <= props->I; i++)
                {   

                    F[arr_idx] += buf1 / well->segs[k].length / buf2 *
                        ( exp(-M_PI * buf2 * fabs(r.y - props->r1.y + 2.0 * (double)(i) * props->sizes.y)) - 
                        exp(-M_PI * buf2 * fabs(r.y + props->r1.y + 2.0 * (double)(i) * props->sizes.y)) ) *
                        sin(M_PI * (double)(m) * r.x / props->sizes.x) * 
                        cos(M_PI * (double)(l) * r.z / props->sizes.z);
                }

                if( fabs(F[arr_idx] - sum_prev1) > F[arr_idx] * EQUALITY_TOLERANCE )
                {
                    sum_prev1 = F[arr_idx];
                    break_idx1 = 0;
                } else
                    break_idx1++;

                if(break_idx1 > 1)
                {
                    //std::cout << "l=" << l << std::endl;
                    break;
                }
            }

            if( fabs(F[arr_idx] - sum_prev2) > F[arr_idx] * EQUALITY_TOLERANCE )
            {
                sum_prev2 = F[arr_idx];
                break_idx2 = 0;
            } else
                break_idx2++;

            if(break_idx2 > 1)
            {
                std::cout << "m=" << m << std::endl;
                break;
            }
        }
    }
}
}

Further investigation. I wrote the following simple program:

#include <cmath>
#include <iostream>
#include <chrono>

#define CYCLE_NUM 1E+7

using namespace std;
using namespace std::chrono;

int main()
{
    double sum = 0.0;

    auto t1 = high_resolution_clock::now();
    for(int i = 1; i < CYCLE_NUM; i++)
    {
        sum += sin((double)(i)) / (double)(i);
    }
    auto t2 = high_resolution_clock::now();

    microseconds::rep t = duration_cast<microseconds>(t2-t1).count();

    cout << "sum = " << sum << endl;
    cout << "time = " << (double)(t) / 1.E+6 << endl;

    return 0;
}

I am really wondering why this simple sample program is 2.5 faster under g++ 4.8.4 libc-2.19 (libm-2.19) than under g++ 5.3.1 libc-2.23 (libm-2.23).

The compile command was:

g++ -std=c++11 -O3 main.cpp -o sum

Using other optimization flags don't change the ratio.

How can I understand who, gcc or libc, slow down the program?

Synxis · Accepted Answer

For a really precise answer, you'll probably need a libm maintainer to look at your question. However, here is my take - take it as a draft, if I find something else I'll add it to this answer.

First, look at the asm generated by GCC, between gcc 4.8.2 and gcc 5.3. There are only 4 differences:

at the beginning a xorpd gets transformed into a pxor, for the same registers
a pxor xmm1, xmm1 was added before the conversion from int to double (cvtsi2sd)
a movsd was moved just before the conversion
the addition (addsd) was moved just before a comparison (ucomisd)

All of this is probably not sufficient for the decrease in performance. Having a fine profiler (intel for example) could allow to be more conclusive, but I don't have access to one.

Now, there is a dependency on sin, so let's see what changed. And the problem is first identifying what platform you use... There are 17 different subfolders in glibc's sysdeps (where sin is defined), so I went for the x86_64 one.

First, the way processor capabilities are handled changed, for example glibc/sysdeps/x86_64/fpu/multiarch/s_sin.c used to do the checking for FMA / AVX in 2.19, but in the 2.23 it is done externally. There could be a bug in which the capabilities are not properly reported, resulting in not using FMA or AVX. I however don't think this hypothesis as very plausible.

Secondly, in .../x86_64/fpu/s_sinf.S, the only modifications (apart from a copyright update) change the stack offset, aligning it to 16 bytes; idem for sincos. Not sure it would make a huge difference.

However, the 2.23 added a lot of sources for vectorized versions of math functions, and some use AVX512 - which your processor probably don't support because it is really new. Maybe libm tries to use such extensions, and since you don't have them, fallback on generic version ?

EDIT: I tried compiling it with gcc 4.8.5, but for it I need to recompile glibc-2.19. For the moment I cannot link, because of this:

/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/libm.a(s_sin.o): in function « __cos »:
(.text+0x3542): undefined reference to « _dl_x86_cpu_features »
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/libm.a(s_sin.o): in function « __sin »:
(.text+0x3572): undefined reference to « _dl_x86_cpu_features »

I will try to resolve this, but beforehand notice that it is very probable that this symbol is responsible for choosing the right optimized version based on the processor, which may be part of the performance hit.

The program runs 3 times slower when compiled with g++ 5.3.1 than the same program compiled with g++ 4.8.4, the same command

Tags:

c++

performance

ubuntu

gcc5

Alex Novikov

1 Answers

Synxis

Recent Activity

Donate For Us

The program runs 3 times slower when compiled with g++ 5.3.1 than the same program compiled with g++ 4.8.4, the same command

Tags:

c++

performance

ubuntu

gcc5

Alex Novikov

1 Answers

Synxis

Related questions

Recent Activity

Donate For Us