In his talk "Efficiency with algorithms, Performance with data structures", Chandler Carruth talks about the need for a better allocator model in C++. The current allocator model invades the type system and makes it almost impossible to work in many project because of that. On the other hand, the Bloomberg allocator model does not invade the type system but is based on virtual functions calls which makes impossible for the compiler to 'see' the allocation and optimize it. In his talk he talks about compilers deduplicating memory allocation (1:06:47).
It took me some time to find some examples, of memory allocation optimization, but I have found this code sample, which compiled under clang, optimize away all the memory allocation and just return 1000000 without allocating anything.
template<typename T>
T* create() { return new T(); }
int main() {
auto result = 0;
for (auto i = 0; i < 1000000; ++i) {
result += (create<int>() != nullptr);
}
return result;
}
The following paper http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html also says that allocation could be fused in compilers and seems to suggest that some compilers already do that sort of things.
As I am very interested in the strategies for efficient memory allocation, I really want to understand why Chandler Carruth is against virtual calls in the Bloomberg model. The example above clearly shows that clang optimize things away when it can see the allocation.
I have found this amazing example which answers the first point of the initial question. Both points 2 and 3 don't have any answer yet.
#include <iostream>
#include <vector>
#include <chrono>
std::vector<double> f_val(std::size_t i, std::size_t n) {
auto v = std::vector<double>( n );
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] = static_cast<double>(k + i);
}
return v;
}
void f_ref(std::size_t i, std::vector<double>& v) {
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] = static_cast<double>(k + i);
}
}
int main (int argc, char const *argv[]) {
const auto n = std::size_t{10};
const auto nb_loops = std::size_t{300000000};
// Begin: Zone 1
{
auto v = std::vector<double>( n, 0.0 );
auto start_time = std::chrono::high_resolution_clock::now();
for (std::size_t i = 0; i < nb_loops; ++i) {
auto w = f_val(i, n);
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] += w[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
std::cout << time << std::endl;
std::cout << v[0] << " " << v[n - 1] << std::endl;
}
// End: Zone 1
{
auto v = std::vector<double>( n, 0.0 );
auto w = std::vector<double>( n );
auto start_time = std::chrono::high_resolution_clock::now();
for (std::size_t i = 0; i < nb_loops; ++i) {
f_ref(i, w);
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] += w[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
std::cout << time << std::endl;
std::cout << v[0] << " " << v[n - 1] << std::endl;
}
return 0;
}
where not a single memory allocation happens in the for loop with f_val. This only happens with Clang though (Gcc and icpc both fail on this) and when building a slighly more complicated example, the optimization is not done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With