I've written a program for search of the maximum in arrays using c++0x threads (for learning purposes). For implementation I used standard thread and future classes. However, parallelized function constantly showes same or worse run time than non-parallelized.
Code is below. I tried to store data in one-dimensional array, multi-dimensional array and ended up with several arrays. However, no option has given good results. I tried to compile and run my code from Eclipse and command line, still with no success. I also tried similar test without array usage. Parallelization gave only 20% speed up there. From my point of view, I run very simple parallel program, without locks and almost no resource sharing (each thread operates on his own array). What is bottleneck?
My machine has Intel Core i7 processor 2.2 GHz with 8 GB of RAM, running Ubuntu 12.04.
const int n = 100000000;
int a[n], b[n], c[n], d[n];
int find_max_usual() {
int res = 0;
for (int i = 0; i < n; ++i) {
res = max(res, a[i]);
res = max(res, b[i]);
res = max(res, c[i]);
res = max(res, d[i]);
}
return res;
}
int find_max(int *a) {
int res = 0;
for (int i = 0; i < n; ++i)
res = max(res, a[i]);
return res;
}
int find_max_parallel() {
future<int> res_a = async(launch::async, find_max, a);
future<int> res_b = async(launch::async, find_max, b);
future<int> res_c = async(launch::async, find_max, c);
future<int> res_d = async(launch::async, find_max, d);
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
return res;
}
double get_time() {
timeval tim;
gettimeofday(&tim, NULL);
double t = tim.tv_sec + (tim.tv_usec / 1000000.0);
return t;
}
int main() {
for (int i = 0; i < n; ++i) {
a[i] = rand();
b[i] = rand();
c[i] = rand();
d[i] = rand();
}
double start = get_time();
int x = find_max_usual();
cerr << x << " " << get_time() - start << endl;
start = get_time();
x = find_max_parallel();
cerr << x << " " << get_time() - start << endl;
return 0;
}
Timing showed that almost all the time in find_max_parralel is consumed by
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
Compilation command line
g++ -O3 -std=c++0x -pthread x.cpp
Update. Problem is solved. I got desired results with same test. 4 threads give about 3.3 speed up, 3 threads give about 2.5 speed up, 2 threads behave almost ideally with 1.9 speed up. I've just rebooted system with some new updates. I haven't seen any significant difference in cpu load and running porgrams.
Thanks to all for help.
You have to explicitly set std::launch::async
.
future<int> res_c = async(std::launch::async, find_max, c);
If you omit the flag std::launch::async | std::launch::deferred
is assumend which leaves it up to implementation to choose whether to start the task asynchronously or deferred.
Current versions of gcc use std::launch::deferred
, MSVC has an runtime scheduler which decides on runtime how the task should be run.
Also note that if you want to try:
std::async(find_max, c);
this will also block because the destructor of std::future
waits for the task to finish.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With