I have a simple program that does some Monte Carlo Algorithmn. One iteration with the algorithmn is without side effects, so I should be able to run it with multiple threads. So this is the relevant part of my whole program, which is written in C++11:
void task(unsigned int max_iter, std::vector<unsigned int> *results, std::vector<unsigned int>::iterator iterator) {
for (unsigned int n = 0; n < max_iter; ++n) {
nume::Album album(535);
unsigned int steps = album.fill_up();
*iterator = steps;
++iterator;
}
}
void aufgabe2() {
std::cout << "\nAufgabe 2\n";
unsigned int max_iter = 10000;
unsigned int thread_count = 4;
std::vector<std::thread> threads(thread_count);
std::vector<unsigned int> results(max_iter);
std::cout << "Computing with " << thread_count << " threads" << std::endl;
int i = 0;
for (std::thread &thread: threads) {
std::vector<unsigned int>::iterator start = results.begin() + max_iter/thread_count * i;
thread = std::thread(task, max_iter/thread_count, &results, start);
i++;
}
for (std::thread &thread: threads) {
thread.join();
}
std::ofstream out;
out.open("out-2a.csv");
for (unsigned int count: results) {
out << count << std::endl;
}
out.close();
std::cout << "Siehe Plot" << std::endl;
}
The puzzling thing is that it gets slower the more threads I add. With 4 threads, I get this:
real 0m5.691s
user 0m3.784s
sys 0m10.844s
Whereas with a single thread:
real 0m1.145s
user 0m0.816s
sys 0m0.320s
I realize that moving data between the CPU cores might add overhead, but the vector
should be declared at startup, and not be modified in the middle. Is there any particular reason for this to be slower on multiple cores?
My system is an i5-2550M, which has 4 cores (2 + Hyperthreading) and I use g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
I saw that using no threads (1), it will have a lot of user load, whereas with threads (2), it will have more kernel than user load:
10K Runs:
http://wstaw.org/m/2013/05/08/stats3.png
100K Runs:
http://wstaw.org/m/2013/05/08/Auswahl_001.png
Current main.cpp
With 100K runs, I get the following:
No threads at all:
real 0m28.705s
user 0m28.468s
sys 0m0.112s
A thread for each part of the program. Those parts do not even use the same memory, so I concurrency for the same container should be out as well. But it takes way longer:
real 2m50.609s
user 2m45.664s
sys 4m35.772s
So although the three main parts take up 300 % of my CPU, they take 6 times as long.
With 1M runs, it took real 4m45
to do. I ran 1M previously, and it took at least real 20m
, if not even real 30m
.
Evaluated your current main.cpp at GitHub. In addition to comments provided above:
The vector object results is shared by all the threads created, so even if your problem is an embarrassingly parallel, due to the shared object, there is a contention not to mention cache misses(I am not good enough to explain about caches on modern architectures). probably you should have n result vectors for your n threads and finally merge the result. That would speed it up, I guess.
Another tip to mention is use std::async whenever possible instead of thread. It handles thread allocation and other low level messes. I read it from Scott Mayer's effective c++11 book. However, by using threads, you can set threads affinity to particular core. So, if your processor supports 8 threads, you can create 8 threads and assign each thread to each core at least on linux.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With