The following code shows a big performance difference of the two versions of min_3
on my machine (Windows 7, VC++ 2015, release).
#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
template <typename X>
const X& max_3_left( const X& a, const X& b, const X& c )
{
return std::max( std::max( a, b ), c );
}
template <typename X>
const X& max_3_right( const X& a, const X& b, const X& c )
{
return std::max( a, std::max( b, c ) );
}
int main()
{
std::random_device r;
std::default_random_engine e1( r() );
std::uniform_int_distribution<int> uniform_dist( 1, 6 );
std::vector<int> numbers;
for ( int i = 0; i < 1000; ++i )
numbers.push_back( uniform_dist( e1 ) );
auto start1 = std::chrono::high_resolution_clock::now();
int sum1 = 0;
for ( int i = 0; i < 1000; ++i )
for ( int j = 0; j < 1000; ++j )
for ( int k = 0; k < 1000; ++k )
sum1 += max_3_left( numbers[i], numbers[j], numbers[k] );
auto finish1 = std::chrono::high_resolution_clock::now();
std::cout << "left " << sum1 << " " <<
std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1).count()
<< " us" << std::endl;
auto start2 = std::chrono::high_resolution_clock::now();
int sum2 = 0;
for ( int i = 0; i < 1000; ++i )
for ( int j = 0; j < 1000; ++j )
for ( int k = 0; k < 1000; ++k )
sum2 += max_3_right( numbers[i], numbers[j], numbers[k] );
auto finish2 = std::chrono::high_resolution_clock::now();
std::cout << "right " << sum2 << " " <<
std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2).count()
<< " us" << std::endl;
}
Output:
left 739861041 796056 us
right 739861041 1442495 us
On ideone the difference is smaller but still not negligible.
Why does this difference exist?
gcc and clang (and presumably MSVC) fail to realize that max
is an associative operation like addition. v[i] max (v[j] max v[k])
(max_3_right) is the same as (v[i] max v[j]) max v[k]
(max_3_left). I'm writing max
as an infix operator to point out the similarity with +
and other associative operations.
Since v[k]
is the only input that's changing inside the inner loop, it's obviously a big win to hoist the (v[i] max v[j])
out of the inner loop.
To see what's actually going on, we as always have to look at the asm. To make it easy to find the asm for the loops, I split them out into separate functions. (Making one template functions with the max3
function as a parameter would be more C++-like). This has the added advantage of taking the code we want optimized out of main
, which gcc marks as "cold", disabling some optimizations.
#include <algorithm>
#define SIZE 1000
int sum_maxright(const std::vector<int> &v) {
int sum = 0;
for ( int i = 0; i < SIZE; ++i )
for ( int j = 0; j < SIZE; ++j )
for ( int k = 0; k < SIZE; ++k )
sum += max_3_right( v[i], v[j], v[k] );
return sum;
}
The inner-most loop of that compiles to (gcc 5.3 targeting x86-64 Linux ABI with -std=gnu++11 -fverbose-asm -O3 -fno-tree-vectorize -fno-unroll-loops -march=haswell
with some hand annotations)
## from outer loops: rdx points to v[k] (starting at v.begin()). r8 is v.end(). (r10 is v.begin)
## edi is v[i], esi is v[j]
## eax is sum
## inner loop. See the full asm on godbolt.org, link below
.L10:
cmp DWORD PTR [rdx], esi # MEM[base: _65, offset: 0], D.92793
mov ecx, esi # D.92793, D.92793
cmovge ecx, DWORD PTR [rdx] # ecx = max(v[j], v[k])
cmp ecx, edi # D.92793, D.92793
cmovl ecx, edi # ecx = max(ecx, v[i])
add rdx, 4 # pointer increment
add eax, ecx # sum, D.92793
cmp rdx, r8 # ivtmp.253, D.92795
jne .L10 #,
Clang 3.8 makes similar code for the max_3_right
loop, with two cmov
instructions inside the inner loop. (Use the compiler dropdown in the Godbolt Compiler Explorer to see.)
gcc and clang both optimize the way you'd expect for the max_3_left
loop, hoisting everything but a single cmov
out of the inner loop.
## register allocation is slightly different here:
## esi = max(v[i], v[j]). rdi = v.end()
.L2:
cmp DWORD PTR [rdx], ecx # MEM[base: _65, offset: 0], D.92761
mov esi, ecx # D.92761, D.92761
cmovge esi, DWORD PTR [rdx] # MEM[base: _65, offset: 0],, D.92761
add rdx, 4 # ivtmp.226,
add eax, esi # sum, D.92761
cmp rdx, rdi # ivtmp.226, D.92762
jne .L2 #,
So there's much less going on in this loop. (On Intel pre-Broadwell, cmov
is a 2-uop instruction, so one fewer cmov
is a big deal.)
BTW, cache prefetching effects can't possibly explain this:
The inner loop accesses numbers[k]
sequentially. The repeated accesses to numbers[i]
and numbers[j]
are hoisted out of the inner loop by any decent compiler, and wouldn't confuse modern prefetchers even if they weren't.
Intel's optimization manual says that up to 32 streams of prefetch patterns can be detected and maintained (with a limit of one forward and one backward per 4k page), for Sandybridge-family microarchitectures (section 2.3.5.4 Data Prefetching).
The OP completely failed to say anything about what hardware he ran this microbenchmark on, but since real compilers hoist the other loads leaving only the most trivial access pattern, it hardly matters.
one vector
of 1000 int
s (4B) only takes 4kiB. This means the whole array easily fits in L1D cache, so there's no need for any kind of prefetching in the first place. It all stays hot in L1 cache pretty much the entire time.
As molbdnilo pointed out, the problem could be with the order of loops. When calculating sum1
, the code can be rewritten as:
for ( int i = 0; i < 1000; ++i )
for ( int j = 0; j < 1000; ++j ) {
auto temp = std::max(numbers[i], numbers[j]);
for ( int k = 0; k < 1000; ++k )
sum1 += std::max(temp, numbers[k]);
}
The same cannot be applied for the calculation of sum2
. However, when I reoredered the second loops as:
for ( int j = 0; j < 1000; ++j )
for ( int k = 0; k < 1000; ++k )
for ( int i = 0; i < 1000; ++i )
sum2 += ...;
I got the same times for both calculations. (Moreover, both calculations are much faster with -O3
than with -O2
. The former seemingly turns on vectorization according to disassembled output.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With