Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory allocation optimized away by compilers

In his talk "Efficiency with algorithms, Performance with data structures", Chandler Carruth talks about the need for a better allocator model in C++. The current allocator model invades the type system and makes it almost impossible to work in many project because of that. On the other hand, the Bloomberg allocator model does not invade the type system but is based on virtual functions calls which makes impossible for the compiler to 'see' the allocation and optimize it. In his talk he talks about compilers deduplicating memory allocation (1:06:47).

It took me some time to find some examples, of memory allocation optimization, but I have found this code sample, which compiled under clang, optimize away all the memory allocation and just return 1000000 without allocating anything.

template<typename T>
T* create() { return new T(); }

int main() {
    auto result = 0;
    for (auto i = 0; i < 1000000; ++i) {
        result += (create<int>() != nullptr);
    }

    return result;
}

The following paper http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html also says that allocation could be fused in compilers and seems to suggest that some compilers already do that sort of things.

As I am very interested in the strategies for efficient memory allocation, I really want to understand why Chandler Carruth is against virtual calls in the Bloomberg model. The example above clearly shows that clang optimize things away when it can see the allocation.

  1. I would like to have a "real life code" where this optimization is useful and done by any current compiler
  2. Do you have any example of code where different allocation are fused by anu current compiler?
  3. Do you understand what Chandler Carruth means when he says that compilers can "deduplicate" your allocation in his talk at 1:06:47?
like image 920
InsideLoop Avatar asked May 02 '15 22:05

InsideLoop


1 Answers

I have found this amazing example which answers the first point of the initial question. Both points 2 and 3 don't have any answer yet.

#include <iostream>
#include <vector>
#include <chrono>

std::vector<double> f_val(std::size_t i, std::size_t n) {
    auto v = std::vector<double>( n );
    for (std::size_t k = 0; k < v.size(); ++k) {
        v[k] = static_cast<double>(k + i);
    }
    return v;
}

void f_ref(std::size_t i, std::vector<double>& v) {
    for (std::size_t k = 0; k < v.size(); ++k) {
        v[k] = static_cast<double>(k + i);
    }
}

int main (int argc, char const *argv[]) {
    const auto n = std::size_t{10};
    const auto nb_loops = std::size_t{300000000};

    // Begin: Zone 1
    {
        auto v = std::vector<double>( n, 0.0 );
        auto start_time = std::chrono::high_resolution_clock::now();
        for (std::size_t i = 0; i < nb_loops; ++i) {
            auto w = f_val(i, n);
            for (std::size_t k = 0; k < v.size(); ++k) {
                v[k] += w[k];
            }
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
        std::cout << time << std::endl;
        std::cout << v[0] << " " << v[n - 1] << std::endl;
    }
    // End: Zone 1

    {
        auto v = std::vector<double>( n, 0.0 );
        auto w = std::vector<double>( n );
        auto start_time = std::chrono::high_resolution_clock::now();
        for (std::size_t i = 0; i < nb_loops; ++i) {
            f_ref(i, w);
            for (std::size_t k = 0; k < v.size(); ++k) {
                v[k] += w[k];
            }
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
        std::cout << time << std::endl;
        std::cout << v[0] << " " << v[n - 1] << std::endl;
    }

    return 0;
}

where not a single memory allocation happens in the for loop with f_val. This only happens with Clang though (Gcc and icpc both fail on this) and when building a slighly more complicated example, the optimization is not done.

like image 154
InsideLoop Avatar answered Oct 15 '22 19:10

InsideLoop