Clang performance drop for specific C++ random number generation

Question

Using C++11's random module, I encountered an odd performance drop when using std::mt19937 (32 and 64bit versions) in combination with a uniform_real_distribution (float or double, doesn't matter). Compared to a g++ compile, it's more than an order of magnitude slower!

The culprit isn't just the mt generator, as it's fast with a uniform_int_distribution. And it isn't a general flaw in the uniform_real_distribution since that's fast with other generators like default_random_engine. Just that specific combination is oddly slow.

I'm not very familiar with the intrinsics, but the Mersenne Twister algorithm is more or less strictly defined, so a difference in implementation couldn't account for this difference I guess? measure Program is following, but here are my results for clang 3.4 and gcc 4.8.1 on a 64bit linux machine:

gcc 4.8.1 runtime_int_default: 185.6 runtime_int_mt: 179.198 runtime_int_mt_64: 175.195 runtime_float_default: 45.375 runtime_float_mt: 58.144 runtime_float_mt_64: 94.188  clang 3.4 runtime_int_default: 215.096 runtime_int_mt: 201.064 runtime_int_mt_64: 199.836 runtime_float_default: 55.143 runtime_float_mt: 744.072  <--- this and runtime_float_mt_64: 783.293 <- this is slow

Program to generate this and try out yourself:

#include <iostream> #include <vector> #include <chrono> #include <random>  template< typename T_rng, typename T_dist> double time_rngs(T_rng& rng, T_dist& dist, int n){     std::vector< typename T_dist::result_type > vec(n, 0);     auto t1 = std::chrono::high_resolution_clock::now();     for (int i = 0; i < n; ++i)         vec[i] = dist(rng);     auto t2 = std::chrono::high_resolution_clock::now();     auto runtime = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000.0;     auto sum = vec[0]; //access to avoid compiler skipping     return runtime; }  int main(){     const int n = 10000000;     unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();     std::default_random_engine rng_default(seed);     std::mt19937 rng_mt (seed);     std::mt19937_64 rng_mt_64 (seed);     std::uniform_int_distribution<int> dist_int(0,1000);     std::uniform_real_distribution<float> dist_float(0.0, 1.0);      // print max values     std::cout << "rng_default_random.max(): " << rng_default.max() << std::endl;     std::cout << "rng_mt.max(): " << rng_mt.max() << std::endl;     std::cout << "rng_mt_64.max(): " << rng_mt_64.max() << std::endl << std::endl;      std::cout << "runtime_int_default: " << time_rngs(rng_default, dist_int, n) << std::endl;     std::cout << "runtime_int_mt: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;     std::cout << "runtime_int_mt_64: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;     std::cout << "runtime_float_default: " << time_rngs(rng_default, dist_float, n) << std::endl;     std::cout << "runtime_float_mt: " << time_rngs(rng_mt, dist_float, n) << std::endl;     std::cout << "runtime_float_mt_64: " << time_rngs(rng_mt_64, dist_float, n) << std::endl; }

compile via clang++ -O3 -std=c++11 random.cpp or g++ respectively. Any ideas?

edit: Finally, Matthieu M. had a great idea: The culprit is inlining, or rather a lack thereof. Increasing the clang inlining limit eliminated the performance penalty. That actually solved a number of performance oddities I encountered. Thanks, I learned something new.

Baum mit Augen · Accepted Answer

As already stated in the comments, the problem is caused by the fact that gcc inlines more aggressive than clang. If we make clang inline very aggressively, the effect disappears:

Compiling your code with g++ -O3 yields

runtime_int_default: 3000.32 runtime_int_mt: 3112.11 runtime_int_mt_64: 3069.48 runtime_float_default: 859.14 runtime_float_mt: 1027.05 runtime_float_mt_64: 1777.48

while clang++ -O3 -mllvm -inline-threshold=10000 yields

runtime_int_default: 3623.89 runtime_int_mt: 751.484 runtime_int_mt_64: 751.132 runtime_float_default: 1072.53 runtime_float_mt: 968.967 runtime_float_mt_64: 1781.34

Apparently, clang now out-inlines gcc in the int_mt cases, but all of the other runtimes are now in the same order of magnitude. I used gcc 4.8.3 and clang 3.4 on Fedora 20 64 bit.

Clang performance drop for specific C++ random number generation

Tags:

Basti

1 Answers

Baum mit Augen

Recent Activity

Donate For Us

Clang performance drop for specific C++ random number generation

Tags:

Basti

1 Answers

Baum mit Augen

Related questions

Recent Activity

Donate For Us