Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ code execution time varies with small source change that shouldn't introduce any extra work

While working on benchmarking some code, I found that its execution time would vary with even the most innocuous code changes.

I have attempted to boil down the code below to the most minimal test case, but it is still rather lengthy (for which I apologize). Changing virtually anything largely affects the benchmark results.

#include <string>
#include <vector>
#include <iostream>
#include <random>
#include <chrono>
#include <functional>

constexpr double usec_to_sec = 1000000.0;

// Simple convenience timer
class Timer
{
    std::chrono::high_resolution_clock::time_point start_time;
public:
    Timer() : start_time(std::chrono::high_resolution_clock::now()) { }
    int64_t operator()() const {
        return static_cast<int64_t>(
        std::chrono::duration_cast<std::chrono::microseconds>(
            std::chrono::high_resolution_clock::now()-start_time).count()
        );
    }
};

// Convenience random number generator
template <typename T>
class RandGen
{
    mutable std::default_random_engine generator;
    std::uniform_int_distribution<T> distribution;

    constexpr unsigned make_seed() const {
        return static_cast<unsigned>(std::chrono::system_clock::now().time_since_epoch().count());
    }
public:
    RandGen(T min, T max) : generator(make_seed()), distribution(min, max) { }
    T operator ()() { return distribution(generator); }
};

// Printer class
class Printer
{
    std::string filename;
    template <class S>    
    friend Printer &operator<<(Printer &, S &&s);
public:
    Printer(const char *filename) : filename(filename) {}
};

template <class S>
Printer &operator<<(Printer &pm, S &&s) {
    std::cout << s;
    return pm;
}

// +------------+
// | Main Stuff |
// +------------+
void runtest(size_t run_length)
{
    static RandGen<size_t> word_sz_generator(10, 20);
    static RandGen<int> rand_char_generator(0, 25);

    size_t total_char_count = 0;
    std::vector<std::string> word_list;
    word_list.reserve(run_length);

    Printer printer("benchmark.dat");
    printer << "Running test... ";

    Timer timer; // start timer
    for (auto i = 0; i < run_length; i++) {

        size_t word_sz = word_sz_generator();
        std::string word;
        for (auto sz = 0; sz < word_sz; sz++) {
            word.push_back(static_cast<char>(rand_char_generator())+'a');
        }
        word_list.emplace_back(std::move(word));
        total_char_count += word_sz;
    }
    int64_t execution_time_usec = timer(); // stop timer

    printer << /*run_length*/ word_list.size() << " words, and " 
            << total_char_count << " total characters, were built in "
            << execution_time_usec/usec_to_sec << " seconds.\n";
}

int main(int argc, char **argv)
{
    constexpr size_t iterations = 30;
    constexpr size_t run_length = 50000000;

    for (auto i = 0; i < iterations; i++)
        runtest(run_length);

    return EXIT_SUCCESS;
}

The 1st class, Timer, is just a small convenience class (intentionally not well-featured, for brevity) for timing the code.

I tried to do without the 2nd class RandGen (which just generates random values), but any attempt to exclude this from the test code made the problem auto-magically disappear. So, I suspect the issue has something to do with it. But I can't figure out how.

The 3rd class Printer seems entirely unnecessary for this question, but again, including it seems to exacerbate the issue.

So, now we're down to main() (which just runs the test) and runtest().

runtest() is hideous, so please don't look at it from a "clean code" point-of-view. Changing it in any way (ex. moving the inner for loop to its own function) results in a change in benchmark results. The simplest, and most perplexing example is the last line:

printer << /*run_length*/ word_list.size() << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

In the line above, run_length and word_list.size() are the same. The size of vector word_list is defined by run_length. But, if I run the code as-is, I get an average execution time of 9.8 seconds, whereas if I uncomment run_length and comment-out word_list.size(), the execution time actually increases to an average of 10.6 seconds. I can't fathom how such an insignificant code change could affect the timing of the whole program to such an extent.

In other words...

9.8 seconds:

printer << /*run_length*/ word_list.size() << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

10.6 seconds:

printer << run_length /*word_list.size()*/ << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

I have repeated the exercise of commenting and uncommenting the variables noted above, and re-running the benchmarks, many times. The benchmarks are repeatable and consistent - i.e. they are consistently 9.8 seconds and 10.6 seconds, respectively.

The code output looks like this, for the two cases:

Running test... 50000000 words, and 750000798 total characters, were built in 9.83379 seconds.
Running test... 50000000 words, and 749978210 total characters, were built in 9.84541 seconds.
Running test... 50000000 words, and 749996688 total characters, were built in 9.87418 seconds.
Running test... 50000000 words, and 749995415 total characters, were built in 9.85704 seconds.
Running test... 50000000 words, and 750017699 total characters, were built in 9.86186 seconds.
Running test... 50000000 words, and 749998680 total characters, were built in 9.83395 seconds.
...

Running test... 50000000 words, and 749988517 total characters, were built in 10.604 seconds.
Running test... 50000000 words, and 749958011 total characters, were built in 10.6283 seconds.
Running test... 50000000 words, and 749994387 total characters, were built in 10.6374 seconds.
Running test... 50000000 words, and 749995242 total characters, were built in 10.6445 seconds.
Running test... 50000000 words, and 749988379 total characters, were built in 10.6543 seconds.
Running test... 50000000 words, and 749969532 total characters, were built in 10.6722 seconds.
...

EZL Software - code timing plot

Any information on what would cause this discrepancy would be greatly appreciated.

Notes:

  1. Even removing the unused std::string filename member object from the Printer class yields different benchmark results - where doing so, eliminates (or reduces to insignificant proportions) the difference between the two benchmarks provided above.
  2. This does not seem to be an issue when compiling with g++ (on Ubuntu). Although, I can’t say this definitively; my tests with Ubuntu were in a VM on the same Windows machine, where the VM perhaps did not have access to all of the resources and processor enhancements.
  3. I am using Visual Studio Community 2017 (version 15.7.4)
    • Compiler version: 19.14.26431
    • All tests and reported results are Release Build, 64-bit
  4. System: Win10, i7-6700K @ 4.00 GHz, 32 GB RAM
like image 315
Blair Fonville Avatar asked Aug 03 '18 00:08

Blair Fonville


2 Answers

You're probably running into some kind of code-alignment effect. Modern x86-64 CPUs are fairly robust with respect to alignment most of the time, but alignment can affect branches aliasing each other in the branch predictors (like @rcgldr mentioned), and various front-end effects.

See https://agner.org/optimize/, and performance links in the x86 tag wiki. But honestly I don't think there's any useful explanation here, other than that you've discovered your loop is sensitive to alignment effects, either from the front-end or from branch prediction. This means that even identical machine code at different alignments in your main program could have different performance.

This is a known phenomenon. An answer on Code alignment in one object file is affecting the performance of a function in another object file has some general comments about how alignment can matter, and see also Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly? There's an article somewhere about how linking object files in a different order can affect performance (and that this is an unexpected effect from the toolchain), but I couldn't find it.

You can use HW performance counters to measure branch misprediction rates to see if that explains why one version is slower than the other. Or if there's some other front-end effect.

But unfortunately there's not much you can do; trivial source differences, if they affect the asm at all, will change alignment for everything.

You can sometimes redesign things to be less sensitive to branch-prediction by replacing branches with branchless code. e.g. always generate 16 bytes of random letters, and truncate that to a random length. (Some branching on size when copying it is probably inevitable, unless creating a 16-byte std::string and then truncating it can be branchless.)

You might speed that up with SIMD, e.g. use a vectorized PRNG like with an SSE2 or AVX2 xorshift+ to generate 16 bytes of random letters at a time. (efficiently getting a uniform 0..25 distribution with packed-byte operations may be tricky, but maybe the same technique as the 0..9 distribution I used to generate 1GiB of space-separated random ASCII digits per ~0.03 seconds on a 3.9GHz Skylake would be useful. It's not perfectly uniformly distributed, though, because 65536 % 10 has a remainder (like 65536/25), but you can probably change the quality vs. speed tradeoff and still run fast.)


Comparing compiler output from the two versions

The asm for both versions of the inner loop in the runtest function are essentially identical, at least if the compiler asm output we see on the Godbolt compiler explorer matches what you're actually getting in the executable from MSVC. (Unlike with gcc/clang, its asm output can't necessarily be assembled into a working object file.) If your real release build does any link-time optimization that could inline some library code, it might make different optimization choices in the final executable.

I put in a #ifdef so I could use -DUSE_RL to have two MSVC 2017 outputs that built the same source different ways, and feed those asm outputs to a diff pane. (The diff pane is at the bottom in the messy layout that I linked; click the fullscreen box on it to show just that.)

The only differences in the whole function are:

  • ordering and register choice for a few instructions like mov edx, DWORD PTR _tls_index and mov QWORD PTR run_length$GSCopy$1$[rbp-121], rcx at the top of the function which only run once. (But not in code-size so they won't affect alignment later). This should have no effect on later code, and they end up making the same changes to architectural state, just using a different scratch reg which is not used again.
  • stack layout (position of local variables relative to RBP). But all the offsets are under +127, so they can all still use a [rbp + disp8] addressing mode.
  • Different code-gen from the actual source difference:

          mov     rdx, QWORD PTR word_list$[rbp-113]
          sub     rdx, QWORD PTR word_list$[rbp-121]  ; word_list.size() = end - start 
          ...
          sar     rdx, 5               ; >> 5   arithmetic right shift
    

    vs.

          mov     rdx, rsi             ; copy run_length from another register
    

    And no, these instructions alone can't possibly explain the speed difference. They're only run once per timing interval, before some I/O.

  • An extra npad 7 for alignment before a branch target near the bottom of the function (after a call _Xtime_get_ticks), after the above code difference.

There's a big block of red/green differences, but those are only from different numbering of labels, except for those three instructions at the start of the function.

But before runtest, the word_list.size() version includes code for a ??$?6_K@@YAAEAVPrinter@@AEAV0@$QEA_K@Z PROC function which doesn't appear anywhere for the version using run_length. (C++ name-mangling turns types into funky characters in the asm names of functions.) This is doing something for class Printer.

You said removing the unused std::string filename from Printer removed the code-gen difference. Well that function probably goes away with that change. IDK why MSVC decided to emit it at all, let alone only in one version vs. another.

Probably g++ -O3 doesn't have that code-gen difference, and that's why you don't see a difference. (Assuming your VM is hardware virtualization, g++-generated machine code is still running natively on the CPU. Getting a new page of memory from the OS might take a tiny bit longer in the VM, but the main time spent in the loop is probably in user-space in this code.)


BTW, gcc warns

<source>:72:24: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]

     for (auto i = 0; i < run_length; i++) {
                      ~~^~~~~~~~~~~~

I didn't look closely at the asm output to see if that led to worse code-gen with gcc or MSVC, or if it's just going to be unsafe if you pass large inputs.

like image 97
Peter Cordes Avatar answered Sep 21 '22 18:09

Peter Cordes


I ran into a similar situation, minor changes to code were having significant effects on run time. After converting the code to assembly to control location of the code, I found a significant difference on an Intel 3770K 3.5ghz processor, depending on where calls and tight loops are located in memory. The most significant difference I found was a 36.5% difference in time, mentioned in this question I posted that was originally about using indexed branching into fall through code versus a tight loop. Even more strange is that it depended on a combination of locations as noted in the comments in the assembly code (perhaps a conflict in the instruction cache?), with the loop version time ranging from 1.465 seconds to 2.000 seconds, identical code, with the only difference being nops between functions used to align the code to specific boundaries.

Indexed branch overhead on X86 64 bit mode

I don't know if other processors would experience this much difference in performance due to location of code.

like image 28
rcgldr Avatar answered Sep 21 '22 18:09

rcgldr