In below program I expect test1 to run slower because of the dependent instructions. A test run with -O2 seemed to confirm this. But then I tried with -O3 and now the timings are more or less equal. How can this be? <pre class="prettyprint"><code>#include <iostream> #include <vector> #include <cstring> #include <chrono> volatile int x = 0; // used for preventing certain optimizations enum { size = 60 * 1000 * 1000 }; std::vector<unsigned> a(size + x); // `size + x` makes the vector size unknown by compiler std::vector<unsigned> b(size + x); void test1() { for (auto i = 1u; i != size; ++i) { a[i] = a[i] + a[i-1]; // data dependency hinders pipelining(?) } } void test2() { for (auto i = 0u; i != size; ++i) { a[i] = a[i] + b[i]; // no data dependencies } } template<typename F> int64_t benchmark(F&& f) { auto start_time = std::chrono::high_resolution_clock::now(); f(); auto elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start_time); return elapsed_ms.count(); } int main(int argc, char**) { // make sure the optimizer cannot make any assumptions // about the contents of the vectors: for (auto& el : a) el = x; for (auto& el : b) el = x; test1(); // warmup std::cout << "test1: " << benchmark(&test1) << '\n'; test2(); // warmup std::cout << "\ntest2: " << benchmark(&test2) << '\n'; return a[x] * x; // prevent optimization and exit with code 0 } </code></pre> I get these results: <pre class="prettyprint"><code>g++-4.8 -std=c++11 -O2 main.cpp && ./a.out test1: 115 test2: 48 g++-4.8 -std=c++11 -O3 main.cpp && ./a.out test1: 29 test2: 38 </code></pre>

Because in <code>-O3</code> gcc effectively eliminates the data dependency, by storing the value of <code>a[i]</code> in a register and reusing it on the next iteration instead of loading <code>a[i-1]</code>. The result is more or less equivalent to: <pre class="prettyprint"><code>void test1() { auto x = a[0]; auto end = a.begin() + size; for (auto it = next(a.begin()); it != end; ++it) { auto y = *it; // Load x = y + x; *it = x; // Store } } </code></pre> Which compiled in <code>-O2</code> yield the exact same assembly as your code compiled in <code>-O3</code>. The second loop in your question is unrolled in <code>-O3</code>, hence the speedup. The two optimizations applied seem to be unrelated to me, the first case is faster simply because gcc removed a load instruction, the second because it is unrolled. In both cases I don't think that the optimizer did anything in particular to improve the cache behavior, both memory access patterns are easy predictable by the cpu.

How come this both loop runs equally fast when compiled with -O3, but not when compiled with -O2?

Tags:

c++

optimization

In below program I expect test1 to run slower because of the dependent instructions. A test run with -O2 seemed to confirm this. But then I tried with -O3 and now the timings are more or less equal. How can this be?

#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>

volatile int x = 0; // used for preventing certain optimizations


enum { size = 60 * 1000 * 1000 };
std::vector<unsigned> a(size + x); // `size + x` makes the vector size unknown by compiler 
std::vector<unsigned> b(size + x);


void test1()
{
    for (auto i = 1u; i != size; ++i)
    {
        a[i] = a[i] + a[i-1]; // data dependency hinders pipelining(?)
    }
}


void test2()
{
    for (auto i = 0u; i != size; ++i)
    {
        a[i] = a[i] + b[i]; // no data dependencies
    }
}


template<typename F>
int64_t benchmark(F&& f)
{
    auto start_time = std::chrono::high_resolution_clock::now();
    f();
    auto elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start_time);
    return elapsed_ms.count();
}


int main(int argc, char**)
{   
    // make sure the optimizer cannot make any assumptions
    // about the contents of the vectors:
    for (auto& el : a) el = x;
    for (auto& el : b) el = x;

    test1(); // warmup
    std::cout << "test1: " << benchmark(&test1) << '\n';

    test2(); // warmup        
    std::cout << "\ntest2: " << benchmark(&test2) << '\n';

    return a[x] * x; // prevent optimization and exit with code 0
}

I get these results:

g++-4.8 -std=c++11 -O2 main.cpp && ./a.out
test1: 115
test2: 48

g++-4.8 -std=c++11 -O3 main.cpp && ./a.out
test1: 29
test2: 38

568

asked Dec 24 '15 14:12

StackedCrooked

1 Answers

Because in -O3 gcc effectively eliminates the data dependency, by storing the value of a[i] in a register and reusing it on the next iteration instead of loading a[i-1].

The result is more or less equivalent to:

void test1()
{
    auto x = a[0];
    auto end = a.begin() + size;
    for (auto it = next(a.begin()); it != end; ++it)
    {
        auto y = *it; // Load
        x = y + x;
        *it = x; // Store
    }
}

Which compiled in -O2 yield the exact same assembly as your code compiled in -O3.

The second loop in your question is unrolled in -O3, hence the speedup. The two optimizations applied seem to be unrelated to me, the first case is faster simply because gcc removed a load instruction, the second because it is unrolled.

In both cases I don't think that the optimizer did anything in particular to improve the cache behavior, both memory access patterns are easy predictable by the cpu.

152

answered Nov 07 '22 17:11

sbabbi

Related questions
                            
                                C++ equivalent to Python's time.time() in Linux? [duplicate]
                            
                                Default Constructor, Java vs C++
                            
                                Standards compliant way to compare float to integral?
                            
                                Botan SSL/TLS example or starting point
                            
                                Is round-trip through floating point always defined behavior if floating point range is bigger?
                            
                                Memory allocation optimized away by compilers
                            
                                Move `unique_ptr`s between sets
                            
                                How to do fast percentile calculation in C++/Rcpp
                            
                                How to provide Scripting support for Qt-Applications?
                            
                                Guaranteed Detection of Temporary->Named Points
                            
                                Visual Studio memory leak detection not printing file name and line number
                            
                                How to intercept keyboard input at the lowest level in linux?
                            
                                C++11 move semantics and rvalue reference
                            
                                C++11 unordered_set with std::owner_less-like hashing
                            
                                Is it possible to get the version of a shared object?
                            
                                Are static locals of function template specializations with T=<unnamed namespaced class> required to be unique?
                            
                                C++ Biological Cell Counting with OpenCV
                            
                                How to call C# function from java [duplicate]
                            
                                Building PBRT v2 error - Error 1 error U1077: 'if' : return code '0x1'
                            
                                Variadic function pointer conversion

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With