I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.
#include <iostream> #include <thread> #include <vector> #include <cstdlib> #include <chrono> #include <cmath> double add_single(int N) { double sum=0; for (int i = 0; i < N; ++i){ sum+= sqrt(1.0*rand()/RAND_MAX); } return sum/N; } void add_multi(int N, double& result) { double sum=0; for (int i = 0; i < N; ++i){ sum+= sqrt(1.0*rand()/RAND_MAX); } result = sum/N; } int main() { srand (time(NULL)); int N = 1000000; // single-threaded auto t1 = std::chrono::high_resolution_clock::now(); double result1 = add_single(N); auto t2 = std::chrono::high_resolution_clock::now(); auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count(); std::cout << "time single: " << time_elapsed << std::endl; // multi-threaded std::vector<std::thread> th; int nr_threads = 3; double partual_results[] = {0,0,0}; t1 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < nr_threads; ++i) th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) )); for(auto &a : th) a.join(); double result_multicore = 0; for(double result:partual_results) result_multicore += result; result_multicore /= nr_threads; t2 = std::chrono::high_resolution_clock::now(); time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count(); std::cout << "time multi: " << time_elapsed << std::endl; return 0; }
Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is
time single: 33 time multi: 565
So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.
edit:
Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!
In fact, multithreading can be slower due to the overhead of creating the threads and context switching between them. The multithreaded program performed worse due to the overhead of creating 100 threads and forcing them all to wait with the mutex .
The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.
It is important to distinguish between these two types of process memory because each thread will have its own stack, but all the threads in a process will share the heap. Threads are sometimes called lightweight processes because they have their own stack but can access shared data.
On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.
void add_multi(int N, double& result) { double sum=0; unsigned int seed = time(NULL); for (int i = 0; i < N; ++i){ sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX); } result = sum/N; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With