Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visual Studio C++ compiler generates 3x slower code when changing completely unrelated code

I have a nested for loop which generates the following assembly:

# branch target labels manually added for readability
002E20F8  mov         ebx,esi  
002E20FA  mov         dword ptr [ebp-10h],3B9ACA00h  
002E2101  sub         ebx,edi  
002E2103  add         ebx,7  
002E2106  shr         ebx,3  
002E2109  nop         dword ptr [eax]  
  outer_loop:
002E2110  xor         eax,eax  
002E2112  xor         ecx,ecx  
002E2114  cmp         edi,esi  
002E2116  mov         edx,ebx  
002E2118  cmova       edx,eax  
002E211B  mov         eax,edi  
002E211D  test        edx,edx 
002E211F  je          main+107h (02E2137h)  ;end_innerloop

  inner_loop:           
002E2121  movsd       xmm0,mmword ptr [eax] 
002E2125  inc         ecx                     ; inc/addsd swapped
002E2126  addsd       xmm0,mmword ptr [k]   
002E212B  add         eax,8  
002E212E  movsd       mmword ptr [k],xmm0  
002E2133  cmp         ecx,edx  
002E2135  jne         main+0F1h (02E2121h)  ;inner_loop
  end_innerloop:        
002E2137  sub         dword ptr [ebp-10h],1  
002E213B  jne         main+0E0h (02E2110h)   ;outer_loop

If I change a line of code before the nested for loop to simply declare an int and then print it out after the for loop. This makes the compiler pull the store/reload of k out of the loop.

The first version of the question described this as "generate the instructions in a slightly different order". (editor's note: perhaps I should leave this analysis / correction for the answer?)

003520F8  mov         ebx,esi  
003520FA  mov         dword ptr [ebp-10h],3B9ACA00h  
00352101  sub         ebx,edi  
00352103  add         ebx,7  
00352106  shr         ebx,3  
00352109  nop         dword ptr [eax]  
  outer_loop:
00352110  xor         eax,eax  
00352112  xor         ecx,ecx  
00352114  cmp         edi,esi  
00352116  mov         edx,ebx  
00352118  cmova       edx,eax  
0035211B  mov         eax,edi  
0035211D  test        edx,edx  
0035211F  je          main+107h (0352137h) ;end_innerloop

00352121  movsd       xmm0,mmword ptr [k]    ; load of k hoisted out of the loop.  Strangely not optimized to xorpd xmm0,xmm0

  inner_loop:
00352126  addsd       xmm0,mmword ptr [eax]
0035212A  inc         ecx  
0035212B  add         eax,8  
0035212E  cmp         ecx,edx  
00352130  jne         main+0F6h (0352126h)  ;inner_loop

00352132  movsd       mmword ptr [k],xmm0     ; movsd in different place.

  end_innerloop:
00352137  sub         dword ptr [ebp-10h],1  
0035213B  jne         main+0E0h (0352110h)  ;outer_loop

This second arrangement by the compiler is 3x faster. I'm slightly shocked by this. Does anyone know what is going on?

This was compiled with Visual Studio 2015.

Compiler flags (I can add more if requested):

Optimization: Maximize Speed /O2

The code:

#include <iostream>
#include <vector>
#include "Stopwatch.h"

static constexpr int N = 1000000000;

int main()
{
    std::vector<double> buffer;

    buffer.resize(10);

    for (auto& i : buffer)
    {
        i = 1e-100;
    }

    double k = 0;
    int h = 0; // removing this line and swapping the lines std::cout << "time = "... results in 3x slower code??!!

    Stopwatch watch;

    for (int i = 0; i < N; i++)
    {
        for (auto& j : buffer)
        {
            k += j;
        }
    }

    //std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << std::endl;
    std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << " / " << h << std::endl;

    std::cout << "Done...";
    std::getchar();

    return EXIT_SUCCESS;
}

Stopwatch class:

#pragma once

#include <chrono>

class Stopwatch
{
private:
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::microseconds microseconds;
    typedef std::chrono::milliseconds milliseconds;

    clock::time_point _start;

public:
    Stopwatch()
    {
        Restart();
    }

    void Restart()
    {
        _start = clock::now();
    }

    double ElapsedMilliseconds()
    {
        return ElapsedMicroseconds() * 1E-3;
    }

    double ElapsedSeconds()
    {
        return ElapsedMicroseconds() * 1E-6;
    }

    Stopwatch(const Stopwatch&) = delete;
    Stopwatch& operator=(const Stopwatch&) = delete;

private:
    double ElapsedMicroseconds()
    {
        return static_cast<double>(std::chrono::duration_cast<microseconds>(clock::now() - _start).count());
    }
};
like image 485
keith Avatar asked Aug 04 '16 12:08

keith


1 Answers

After editing the question to fix your confusingly-placed line-breaks, and add branch-target labels in front of the addresses in the jcc instructions to figure out what the code was actually doing, it became clear that the loops were significantly different. The movsd isn't reordered within the loop; it's outside the loop.

Instead of leaving that stuff in the question and correcting it in the answer, I decided to edit the question and talk about it here. I figured the code blocks were long enough that future readers would just get bogged down trying to keep track of 4 versions of the code, and it wasn't something that would help people with the same question would find it with a search engine.


The fast version keeps k in a register (xmm0), while the slow version reloads/stores it every iteration. This is often a sign that the compiler's alias analysis failed to prove that things couldn't overlap.

It's not the extra stores and loads themselves that hurt, it's the fact that it lengthens the loop-carried dependency chain by the store-forwarding latency from the store in one iteration to the load in the next iteration. Store-forwarding latency is something like 6 cycles on modern Intel CPUs, vs. 3 cycles for addsd (on Haswell for example). So that perfectly explains the factor of 3 speedup:

  • 9 cycles per iteration when the loop-carried dependency chain is addsd + store-forwarding
  • 3 cycles per iteration when the loop-carried dependency chain is just addsd

See http://agner.org/optimize/ for instruction tables and microarch details. Also other links in the x86 tag wiki.


IDK how MSVC didn't manage to prove that k doesn't overlap with anything, because it's a local whose address doesn't escape the function. (Its address isn't even taken). MSVC does a horrible job there. It should also just xorps xmm0,xmm0 to zero it ahead of the loop, instead of loading some zeroed memory. I don't even see where it zeros any memory; I guess that's not the asm for the whole function.

If you'd compiled with MSVC's equivalent of -ffast-math, it could have vectorized the reduction (with addpd), and hopefully multiple accumulators. Although with such a tiny vector that you're looping over many times, the non-multiple-of-4 element count is moderately inconvenient. Still, the loop overhead isn't an issue here; the loop-carried dependency chain dominates even when k is kept in a register, since your code only uses one accumulator. One addsd per 3 clocks leaves tons of time for other insns to run.

Ideally, allowing associative FP math reordering would get the compiler to optimize it to k = N * std::accumulate(...); like @Ped7g suggested, treating the sum over the array as a common subexpression.


BTW, there are much better ways to initialize a vector:

Instead of resizing a vector (constructing new elements with the default constructor) and then writing new values, you should just do something like

std::vector<double> buffer(10, 1e-100);   // 10 elements set to 1e-100

That makes sure the asm doesn't waste time storing zeros before storing the value you want. I think resize can also take a value to copy into new elements, so you can still declare an empty vector and then resize it.

like image 122
Peter Cordes Avatar answered Oct 23 '22 03:10

Peter Cordes