Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How come this both loop runs equally fast when compiled with -O3, but not when compiled with -O2?

In below program I expect test1 to run slower because of the dependent instructions. A test run with -O2 seemed to confirm this. But then I tried with -O3 and now the timings are more or less equal. How can this be?

#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>

volatile int x = 0; // used for preventing certain optimizations


enum { size = 60 * 1000 * 1000 };
std::vector<unsigned> a(size + x); // `size + x` makes the vector size unknown by compiler 
std::vector<unsigned> b(size + x);


void test1()
{
    for (auto i = 1u; i != size; ++i)
    {
        a[i] = a[i] + a[i-1]; // data dependency hinders pipelining(?)
    }
}


void test2()
{
    for (auto i = 0u; i != size; ++i)
    {
        a[i] = a[i] + b[i]; // no data dependencies
    }
}


template<typename F>
int64_t benchmark(F&& f)
{
    auto start_time = std::chrono::high_resolution_clock::now();
    f();
    auto elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start_time);
    return elapsed_ms.count();
}


int main(int argc, char**)
{   
    // make sure the optimizer cannot make any assumptions
    // about the contents of the vectors:
    for (auto& el : a) el = x;
    for (auto& el : b) el = x;

    test1(); // warmup
    std::cout << "test1: " << benchmark(&test1) << '\n';

    test2(); // warmup        
    std::cout << "\ntest2: " << benchmark(&test2) << '\n';

    return a[x] * x; // prevent optimization and exit with code 0
}

I get these results:

g++-4.8 -std=c++11 -O2 main.cpp && ./a.out
test1: 115
test2: 48

g++-4.8 -std=c++11 -O3 main.cpp && ./a.out
test1: 29
test2: 38
like image 568
StackedCrooked Avatar asked Dec 24 '15 14:12

StackedCrooked


People also ask

Is O2 faster than O3?

If you are jumping around a large amount of code, -O3 generally performs more poorly than -O2, but if you are running a tight loop (like HPC code), -O3 is better.

Does gcc optimize by default?

GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).

What is O3 in gcc?

-O3 : the highest level of optimization possible. It enables optimizations that are expensive in terms of compile time and memory usage. Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage.

Why should code optimization be controlled by a compile time flag?

Turning on optimization flags makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program.


1 Answers

Because in -O3 gcc effectively eliminates the data dependency, by storing the value of a[i] in a register and reusing it on the next iteration instead of loading a[i-1].

The result is more or less equivalent to:

void test1()
{
    auto x = a[0];
    auto end = a.begin() + size;
    for (auto it = next(a.begin()); it != end; ++it)
    {
        auto y = *it; // Load
        x = y + x;
        *it = x; // Store
    }
}

Which compiled in -O2 yield the exact same assembly as your code compiled in -O3.

The second loop in your question is unrolled in -O3, hence the speedup. The two optimizations applied seem to be unrelated to me, the first case is faster simply because gcc removed a load instruction, the second because it is unrolled.

In both cases I don't think that the optimizer did anything in particular to improve the cache behavior, both memory access patterns are easy predictable by the cpu.

like image 152
sbabbi Avatar answered Nov 07 '22 17:11

sbabbi