In below program I expect test1 to run slower because of the dependent instructions. A test run with -O2 seemed to confirm this. But then I tried with -O3 and now the timings are more or less equal. How can this be?
#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>
volatile int x = 0; // used for preventing certain optimizations
enum { size = 60 * 1000 * 1000 };
std::vector<unsigned> a(size + x); // `size + x` makes the vector size unknown by compiler
std::vector<unsigned> b(size + x);
void test1()
{
for (auto i = 1u; i != size; ++i)
{
a[i] = a[i] + a[i-1]; // data dependency hinders pipelining(?)
}
}
void test2()
{
for (auto i = 0u; i != size; ++i)
{
a[i] = a[i] + b[i]; // no data dependencies
}
}
template<typename F>
int64_t benchmark(F&& f)
{
auto start_time = std::chrono::high_resolution_clock::now();
f();
auto elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start_time);
return elapsed_ms.count();
}
int main(int argc, char**)
{
// make sure the optimizer cannot make any assumptions
// about the contents of the vectors:
for (auto& el : a) el = x;
for (auto& el : b) el = x;
test1(); // warmup
std::cout << "test1: " << benchmark(&test1) << '\n';
test2(); // warmup
std::cout << "\ntest2: " << benchmark(&test2) << '\n';
return a[x] * x; // prevent optimization and exit with code 0
}
I get these results:
g++-4.8 -std=c++11 -O2 main.cpp && ./a.out
test1: 115
test2: 48
g++-4.8 -std=c++11 -O3 main.cpp && ./a.out
test1: 29
test2: 38
If you are jumping around a large amount of code, -O3 generally performs more poorly than -O2, but if you are running a tight loop (like HPC code), -O3 is better.
GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).
-O3 : the highest level of optimization possible. It enables optimizations that are expensive in terms of compile time and memory usage. Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage.
Turning on optimization flags makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program.
Because in -O3
gcc effectively eliminates the data dependency, by storing the value of a[i]
in a register and reusing it on the next iteration instead of loading a[i-1]
.
The result is more or less equivalent to:
void test1()
{
auto x = a[0];
auto end = a.begin() + size;
for (auto it = next(a.begin()); it != end; ++it)
{
auto y = *it; // Load
x = y + x;
*it = x; // Store
}
}
Which compiled in -O2
yield the exact same assembly as your code compiled in -O3
.
The second loop in your question is unrolled in -O3
, hence the speedup. The two optimizations applied seem to be unrelated to me, the first case is faster simply because gcc removed a load instruction, the second because it is unrolled.
In both cases I don't think that the optimizer did anything in particular to improve the cache behavior, both memory access patterns are easy predictable by the cpu.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With