I have a nested for loop which generates the following assembly:
# branch target labels manually added for readability
002E20F8 mov ebx,esi
002E20FA mov dword ptr [ebp-10h],3B9ACA00h
002E2101 sub ebx,edi
002E2103 add ebx,7
002E2106 shr ebx,3
002E2109 nop dword ptr [eax]
outer_loop:
002E2110 xor eax,eax
002E2112 xor ecx,ecx
002E2114 cmp edi,esi
002E2116 mov edx,ebx
002E2118 cmova edx,eax
002E211B mov eax,edi
002E211D test edx,edx
002E211F je main+107h (02E2137h) ;end_innerloop
inner_loop:
002E2121 movsd xmm0,mmword ptr [eax]
002E2125 inc ecx ; inc/addsd swapped
002E2126 addsd xmm0,mmword ptr [k]
002E212B add eax,8
002E212E movsd mmword ptr [k],xmm0
002E2133 cmp ecx,edx
002E2135 jne main+0F1h (02E2121h) ;inner_loop
end_innerloop:
002E2137 sub dword ptr [ebp-10h],1
002E213B jne main+0E0h (02E2110h) ;outer_loop
If I change a line of code before the nested for loop to simply declare an int
and then print it out after the for loop. This makes the compiler pull the store/reload of k
out of the loop.
The first version of the question described this as "generate the instructions in a slightly different order". (editor's note: perhaps I should leave this analysis / correction for the answer?)
003520F8 mov ebx,esi
003520FA mov dword ptr [ebp-10h],3B9ACA00h
00352101 sub ebx,edi
00352103 add ebx,7
00352106 shr ebx,3
00352109 nop dword ptr [eax]
outer_loop:
00352110 xor eax,eax
00352112 xor ecx,ecx
00352114 cmp edi,esi
00352116 mov edx,ebx
00352118 cmova edx,eax
0035211B mov eax,edi
0035211D test edx,edx
0035211F je main+107h (0352137h) ;end_innerloop
00352121 movsd xmm0,mmword ptr [k] ; load of k hoisted out of the loop. Strangely not optimized to xorpd xmm0,xmm0
inner_loop:
00352126 addsd xmm0,mmword ptr [eax]
0035212A inc ecx
0035212B add eax,8
0035212E cmp ecx,edx
00352130 jne main+0F6h (0352126h) ;inner_loop
00352132 movsd mmword ptr [k],xmm0 ; movsd in different place.
end_innerloop:
00352137 sub dword ptr [ebp-10h],1
0035213B jne main+0E0h (0352110h) ;outer_loop
This second arrangement by the compiler is 3x faster. I'm slightly shocked by this. Does anyone know what is going on?
This was compiled with Visual Studio 2015.
Compiler flags (I can add more if requested):
Optimization: Maximize Speed /O2
The code:
#include <iostream>
#include <vector>
#include "Stopwatch.h"
static constexpr int N = 1000000000;
int main()
{
std::vector<double> buffer;
buffer.resize(10);
for (auto& i : buffer)
{
i = 1e-100;
}
double k = 0;
int h = 0; // removing this line and swapping the lines std::cout << "time = "... results in 3x slower code??!!
Stopwatch watch;
for (int i = 0; i < N; i++)
{
for (auto& j : buffer)
{
k += j;
}
}
//std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << std::endl;
std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << " / " << h << std::endl;
std::cout << "Done...";
std::getchar();
return EXIT_SUCCESS;
}
Stopwatch class:
#pragma once
#include <chrono>
class Stopwatch
{
private:
typedef std::chrono::high_resolution_clock clock;
typedef std::chrono::microseconds microseconds;
typedef std::chrono::milliseconds milliseconds;
clock::time_point _start;
public:
Stopwatch()
{
Restart();
}
void Restart()
{
_start = clock::now();
}
double ElapsedMilliseconds()
{
return ElapsedMicroseconds() * 1E-3;
}
double ElapsedSeconds()
{
return ElapsedMicroseconds() * 1E-6;
}
Stopwatch(const Stopwatch&) = delete;
Stopwatch& operator=(const Stopwatch&) = delete;
private:
double ElapsedMicroseconds()
{
return static_cast<double>(std::chrono::duration_cast<microseconds>(clock::now() - _start).count());
}
};
After editing the question to fix your confusingly-placed line-breaks, and add branch-target labels in front of the addresses in the jcc
instructions to figure out what the code was actually doing, it became clear that the loops were significantly different. The movsd
isn't reordered within the loop; it's outside the loop.
Instead of leaving that stuff in the question and correcting it in the answer, I decided to edit the question and talk about it here. I figured the code blocks were long enough that future readers would just get bogged down trying to keep track of 4 versions of the code, and it wasn't something that would help people with the same question would find it with a search engine.
The fast version keeps k
in a register (xmm0
), while the slow version reloads/stores it every iteration. This is often a sign that the compiler's alias analysis failed to prove that things couldn't overlap.
It's not the extra stores and loads themselves that hurt, it's the fact that it lengthens the loop-carried dependency chain by the store-forwarding latency from the store in one iteration to the load in the next iteration. Store-forwarding latency is something like 6 cycles on modern Intel CPUs, vs. 3 cycles for addsd
(on Haswell for example). So that perfectly explains the factor of 3 speedup:
addsd
+ store-forwardingaddsd
See http://agner.org/optimize/ for instruction tables and microarch details. Also other links in the x86 tag wiki.
IDK how MSVC didn't manage to prove that k
doesn't overlap with anything, because it's a local whose address doesn't escape the function. (Its address isn't even taken). MSVC does a horrible job there. It should also just xorps xmm0,xmm0
to zero it ahead of the loop, instead of loading some zeroed memory. I don't even see where it zeros any memory; I guess that's not the asm for the whole function.
If you'd compiled with MSVC's equivalent of -ffast-math
, it could have vectorized the reduction (with addpd
), and hopefully multiple accumulators. Although with such a tiny vector that you're looping over many times, the non-multiple-of-4 element count is moderately inconvenient. Still, the loop overhead isn't an issue here; the loop-carried dependency chain dominates even when k
is kept in a register, since your code only uses one accumulator. One addsd
per 3 clocks leaves tons of time for other insns to run.
Ideally, allowing associative FP math reordering would get the compiler to optimize it to k = N * std::accumulate(...);
like @Ped7g suggested, treating the sum over the array as a common subexpression.
BTW, there are much better ways to initialize a vector:
Instead of resizing a vector (constructing new elements with the default constructor) and then writing new values, you should just do something like
std::vector<double> buffer(10, 1e-100); // 10 elements set to 1e-100
That makes sure the asm doesn't waste time storing zeros before storing the value you want. I think resize
can also take a value to copy into new elements, so you can still declare an empty vector and then resize it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With