Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.

Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:

#include <stdio.h>
#include <stdlib.h>
#include <complex>

using namespace std;

const int pp = 29;
typedef complex<double> cdbl;

int main() {
  cdbl ff[pp], gg[pp];
  for(int ii = 0; ii < pp; ii++) {
    ff[ii] = gg[ii] = 1.0;
  }

  for(int it = 0; it < 1000; it++) {
    cdbl dual[pp];

    for(int ii = 0; ii < pp; ii++) {
      dual[ii] = 0.0;
    }

    for(int h1 = 0; h1 < pp; h1 ++) {
      for(int h2 = 0; h2 < pp; h2 ++) {
        cdbl avg_right = 0.0;
        for(int xx = 0; xx < pp; xx ++) {
          int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp, 
              c11 = (xx + h1 + h2) % pp;
          avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
        }
        avg_right /= static_cast<cdbl>(pp);

        for(int xx = 0; xx < pp; xx ++) {
          int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp, 
              c11 = (xx + h1 + h2) % pp;
          dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
        }
      }
    }
    for(int ii = 0; ii < pp; ii++) {
      dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
    }

    for(int ii = 0; ii < pp; ii++) {
      gg[ii] = dual[ii];
    }

#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
    printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif

  }
  printf("%.15lf\n", gg[0].real());

  return 0;
}

Here are the results of running this on my system:

me@mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me@mine $ time ./test.elf > /dev/null
    real    0m7.329s
    user    0m7.328s
    sys     0m0.000s
me@mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me@mine $ time ./test.elf > /dev/null
    real    0m0.492s
    user    0m0.490s
    sys     0m0.001s
me@mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]

It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.

So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.

I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.

I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.

I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.

Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.


UPDATE: with apologies for the increasing length of the question, I've shrunk the example to

#include <stdio.h>
#include <stdlib.h>
#include <complex>

using namespace std;

const int pp = 29;
typedef complex<double> cdbl;

int main() {
  cdbl ff[pp];
  cdbl blah = 0.0;
  for(int ii = 0; ii < pp; ii++) {
    ff[ii] = 1.0;
  }

  for(int it = 0; it < 1000; it++) {
    cdbl xx = 0.0;

    for(int kk = 0; kk < 100; kk++) {
      for(int ii = 0; ii < pp; ii++) {
        for(int jj = 0; jj < pp; jj++) {
          xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
        }
      }
    }
    blah += xx;

    printf("%.15lf\n", blah.real());
  }
  printf("%.15lf\n", blah.real());

  return 0;
}

I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.

The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).

Also, compiling -static makes things fast again.


Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:

It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.

Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.

So, e.g.:

#include <stdio.h>
#include <stdlib.h>
#include <complex>

using namespace std;

int main() {
  complex<double> foo = 0.0;
  foo += foo * foo; // 1
  char str[10];
  sprintf(str, "%c\n", 'c');
  //fflush(stdout); // 2

  for(int it = 0; it < 100000000; it++) {
    foo += foo * foo;
  }

  return (foo.real() > 10.0);
}

is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.

It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".

like image 570
Freddie Manners Avatar asked Feb 21 '17 03:02

Freddie Manners


People also ask

How do you debug a printf statement?

Print statements is typically the first way developers begin to debug their code because they are easy to add. They often will begin by simply adding printf statements to their code and may progress to more advanced logging solutions provided by open source libraries such as log4c, log4cplus, Boost.Log or glog.

What are some of the drawbacks of using print statements?

Let’s see some drawbacks for this approach: 1 With more code being added, it's difficult to use print statements in every module, class, or definition the code is traversing through. 2 Even before we notice the bug, the code gets executed and goes to the next steps. 3 Wait till a lengthy execution is complete to find and make a fix. More items...

What is the purpose of the print () statement in C++?

That one line of the program has a print statement that we use daily in our programming without even knowing its intricacies. The purpose of the print () statement is to print the given object to the standard output device or to the text stream file. As we have seen the basic syntax of a print function, let us discuss its parameters in details:

What is the default parameter for the end of print () function?

As you saw ealier in the syntax breakdown of the print () function, the default parameter for the keyword argument end was ' '. By default, after each print call, a new line is created. If you call print two times separately, one after the other, you'll see that the second call is displayed on a newline immediately after the first call.


1 Answers

For the record, I finally figured this out.

It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:

When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.

The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).

This is the ultimate cause of the slowdown. Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):

$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf

 Performance counter stats for './slow.elf':
    100,000,064      cpu/event=0xc1,umask=0x08/
    100,000,118      cpu/event=0xc1,umask=0x10/

(event codes taken from table 19.5 of another Intel manual).

That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:

When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.

I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).

Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.

The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.

I'm not quite sure what we learn here, but there it is.

like image 79
Freddie Manners Avatar answered Nov 15 '22 00:11

Freddie Manners