Here is a piece of C++11 code for a multi-threaded vector sum.
#include <thread>
template<typename ITER>
void sum_partial(ITER a, ITER b, double & result) {
result = std::accumulate(a, b, 0.0);
}
template<typename ITER>
double sum(ITER begin, ITER end, unsigned int nb_threads) {
size_t len = std::distance(begin, end);
size_t size = len/nb_threads;
std::vector<std::thread> thr(nb_threads-1);
std::vector<double> r(nb_threads);
size_t be = 0;
for(size_t i = 0; i < nb_threads-1; i++) {
size_t en = be + size;
thr[i] = std::thread(sum_partial<ITER>, begin + be, begin + en, std::ref(r[i]));
be = en;
}
sum_partial(begin + be, begin + len, r[nb_threads-1]);
for(size_t i = 0; i < nb_threads-1; i++)
thr[i].join();
return std::accumulate(r.begin(), r.end(), 0.0);
}
The typical use will be sum(x.begin(), x.end(), n)
with x
a vector of doubles.
Here is a graph displaying the computation time as a function of the number of threads (average time for summing 10⁷ values, on a 8 cores computer with nothing else running -- I tried on a 32 cores computer, the behaviour is very similar).
Why is the scalability so poor? Can it be improved?
My (very limited) understanding is that to have good scalability, threads should avoid to write in the same cache line. Here all threads write in r
once, at the very end of their computation, I wouldn't expect it to be the limiting factor. Is it a memory bandwidth problem?
Multithreading helps the programming code flow for network operations become simple and sequential. And when our application needs enhanced performance or to improve its scalability, we increase the number of threads. But to reach a scale that can support thousands of concurrent requests, threads aren't enough.
Multithreading is the ability of a program or an operating system to enable more than one user at a time without requiring multiple copies of the program running on the computer. Multithreading can also handle multiple requests from the same user.
To test multi-thread functionality, let the multiple instances of the application or program to be tested be active at the same time. Run the multi-thread program on different hardware. Thread testing is a form of session testing for which sessions are formed of threads.
accumulate
has low utilitation on cpu arithmetic units, but cache and memory throughput will most likely be the bottleneck, especially for 10^7 double, or 10 million double = 80MB data, which is way more than your CPU cache size.
To overcome the cache and memory throughput bottleneck, you might want to enable prefetch with -fprefetch-loop-arrays
, or even manually do some assembly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With