Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this C++11 code containing rand() slower with multiple threads than with one?

I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.

#include <iostream> #include <thread> #include <vector> #include <cstdlib> #include <chrono> #include <cmath>  double add_single(int N) {     double sum=0;     for (int i = 0; i < N; ++i){         sum+= sqrt(1.0*rand()/RAND_MAX);     }     return sum/N; }  void add_multi(int N, double& result) {     double sum=0;     for (int i = 0; i < N; ++i){         sum+= sqrt(1.0*rand()/RAND_MAX);     }     result = sum/N; }  int main() {     srand (time(NULL));     int N = 1000000;      // single-threaded     auto t1 = std::chrono::high_resolution_clock::now();     double result1 = add_single(N);     auto t2 = std::chrono::high_resolution_clock::now();     auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();     std::cout << "time single: " << time_elapsed << std::endl;      // multi-threaded     std::vector<std::thread> th;     int nr_threads = 3;     double partual_results[] = {0,0,0};     t1 = std::chrono::high_resolution_clock::now();     for (int i = 0; i < nr_threads; ++i)          th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));     for(auto &a : th)         a.join();     double result_multicore = 0;     for(double result:partual_results)         result_multicore += result;     result_multicore /= nr_threads;     t2 = std::chrono::high_resolution_clock::now();     time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();     std::cout << "time multi: " << time_elapsed << std::endl;      return 0; } 

Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is

time single: 33 time multi: 565 

So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.

edit:

  1. This problem scales for larger N, so the problem is not the short runtime
  2. The time for creating the threads is not the problem. Excluding it does not change the result significantly

Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!

like image 763
Basti Avatar asked May 23 '13 14:05

Basti


People also ask

Why is my multithreading slower?

In fact, multithreading can be slower due to the overhead of creating the threads and context switching between them. The multithreaded program performed worse due to the overhead of creating 100 threads and forcing them all to wait with the mutex .

Is Rand function thread safe?

The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.

Do threads share heap?

It is important to distinguish between these two types of process memory because each thread will have its own stack, but all the threads in a process will share the heap. Threads are sometimes called lightweight processes because they have their own stack but can access shared data.


1 Answers

On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.

void add_multi(int N, double& result) { double sum=0; unsigned int seed = time(NULL); for (int i = 0; i < N; ++i){     sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX); } result = sum/N; } 
like image 190
Croniak Avatar answered Oct 17 '22 20:10

Croniak