I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers. <pre class="prettyprint"><code>#include <iostream> #include <thread> #include <vector> #include <cstdlib> #include <chrono> #include <cmath> double add_single(int N) { double sum=0; for (int i = 0; i < N; ++i){ sum+= sqrt(1.0*rand()/RAND_MAX); } return sum/N; } void add_multi(int N, double& result) { double sum=0; for (int i = 0; i < N; ++i){ sum+= sqrt(1.0*rand()/RAND_MAX); } result = sum/N; } int main() { srand (time(NULL)); int N = 1000000; // single-threaded auto t1 = std::chrono::high_resolution_clock::now(); double result1 = add_single(N); auto t2 = std::chrono::high_resolution_clock::now(); auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count(); std::cout << "time single: " << time_elapsed << std::endl; // multi-threaded std::vector<std::thread> th; int nr_threads = 3; double partual_results[] = {0,0,0}; t1 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < nr_threads; ++i) th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) )); for(auto &a : th) a.join(); double result_multicore = 0; for(double result:partual_results) result_multicore += result; result_multicore /= nr_threads; t2 = std::chrono::high_resolution_clock::now(); time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count(); std::cout << "time multi: " << time_elapsed << std::endl; return 0; } </code></pre> Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is <pre class="prettyprint"><code>time single: 33 time multi: 565 </code></pre> So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas. edit: <ol> <li>This problem scales for larger N, so the problem is not the short runtime</li> <li>The time for creating the threads is not the problem. Excluding it does not change the result significantly</li> </ol> Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!

On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected. <pre class="prettyprint"><code>void add_multi(int N, double& result) { double sum=0; unsigned int seed = time(NULL); for (int i = 0; i < N; ++i){ sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX); } result = sum/N; } </code></pre>

Why is this C++11 code containing rand() slower with multiple threads than with one?

I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.

#include <iostream> #include <thread> #include <vector> #include <cstdlib> #include <chrono> #include <cmath>  double add_single(int N) {     double sum=0;     for (int i = 0; i < N; ++i){         sum+= sqrt(1.0*rand()/RAND_MAX);     }     return sum/N; }  void add_multi(int N, double& result) {     double sum=0;     for (int i = 0; i < N; ++i){         sum+= sqrt(1.0*rand()/RAND_MAX);     }     result = sum/N; }  int main() {     srand (time(NULL));     int N = 1000000;      // single-threaded     auto t1 = std::chrono::high_resolution_clock::now();     double result1 = add_single(N);     auto t2 = std::chrono::high_resolution_clock::now();     auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();     std::cout << "time single: " << time_elapsed << std::endl;      // multi-threaded     std::vector<std::thread> th;     int nr_threads = 3;     double partual_results[] = {0,0,0};     t1 = std::chrono::high_resolution_clock::now();     for (int i = 0; i < nr_threads; ++i)          th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));     for(auto &a : th)         a.join();     double result_multicore = 0;     for(double result:partual_results)         result_multicore += result;     result_multicore /= nr_threads;     t2 = std::chrono::high_resolution_clock::now();     time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();     std::cout << "time multi: " << time_elapsed << std::endl;      return 0; }

Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is

time single: 33 time multi: 565

So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.

edit:

This problem scales for larger N, so the problem is not the short runtime
The time for creating the threads is not the problem. Excluding it does not change the result significantly

Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!

Why is my multithreading slower?

In fact, multithreading can be slower due to the overhead of creating the threads and context switching between them. The multithreaded program performed worse due to the overhead of creating 100 threads and forcing them all to wait with the mutex .

Is Rand function thread safe?

The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.

Do threads share heap?

It is important to distinguish between these two types of process memory because each thread will have its own stack, but all the threads in a process will share the heap. Threads are sometimes called lightweight processes because they have their own stack but can access shared data.

On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.

void add_multi(int N, double& result) { double sum=0; unsigned int seed = time(NULL); for (int i = 0; i < N; ++i){     sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX); } result = sum/N; }

Why is this C++11 code containing rand() slower with multiple threads than with one?

Tags:

c++

performance

multithreading

c++11

Basti

People also ask

1 Answers

Croniak

Recent Activity

Donate For Us

Why is this C++11 code containing rand() slower with multiple threads than with one?

Tags:

c++

performance

multithreading

c++11

Basti

People also ask

1 Answers

Croniak

Related questions

Recent Activity

Donate For Us