I'm playing around with threads in C++, in particular using them to parallelize a map operation.
Here's the code:
#include <thread>
#include <iostream>
#include <cstdlib>
#include <vector>
#include <math.h>
#include <stdio.h>
double multByTwo(double x){
return x*2;
}
double doJunk(double x){
return cos(pow(sin(x*2),3));
}
template <typename T>
void map(T* data, int n, T (*ptr)(T)){
for (int i=0; i<n; i++)
data[i] = (*ptr)(data[i]);
}
template <typename T>
void parallelMap(T* data, int n, T (*ptr)(T)){
int NUMCORES = 3;
std::vector<std::thread> threads;
for (int i=0; i<NUMCORES; i++)
threads.push_back(std::thread(&map<T>, data + i*n/NUMCORES, n/NUMCORES, ptr));
for (std::thread& t : threads)
t.join();
}
int main()
{
int n = 1000000000;
double* nums = new double[n];
for (int i=0; i<n; i++)
nums[i] = i;
std::cout<<"go"<<std::endl;
clock_t c1 = clock();
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
// also try with &doJunk
//parallelMap(nums, n, &multByTwo);
map(nums, n, &doJunk);
std::cout << nums[342] << std::endl;
clock_gettime(CLOCK_MONOTONIC, &finish);
printf("CPU elapsed time is %f seconds\n", double(clock()-c1)/CLOCKS_PER_SEC);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("Actual elapsed time is %f seconds\n", elapsed);
}
With multByTwo
the parallel version is actually slightly slower (1.01 seconds versus .95 real time), and with doJunk its faster (51 versus 136 real time). This implies to me that
Just a guess: what you're likely seeing is that the multByTwo
code is so fast that you're achieving memory saturation. The code will never run any faster no matter how much processor power you throw at it, because it's already going as fast as it can get the bits to and from RAM.
You did not specify the hardware that you test your program nor the compiler version and the operating system. I did try your code on our four-socket Intel Xeon systems under 64-bit Scientific Linux with g++
4.7 compiled from source.
First on an older Xeon X7350 system I got the following timings:
multByTwo
with map
CPU elapsed time is 6.690000 seconds
Actual elapsed time is 6.691940 seconds
multByTwo
with parallelMap
on 3 cores
CPU elapsed time is 7.330000 seconds
Actual elapsed time is 2.480294 seconds
The parallel speedup is 2.7x.
doJunk
with map
CPU elapsed time is 209.250000 seconds
Actual elapsed time is 209.289025 seconds
doJunk
with parallelMap
on 3 cores
CPU elapsed time is 220.770000 seconds
Actual elapsed time is 73.900960 seconds
The parallel speedup is 2.83x.
Note that X7350 is from the quite old pre-Nehalem "Tigerton" family with FSB bus and a shared memory controller located in the north bridge. This is a pure SMP system with no NUMA effects.
Then I run your code on a four-socket Intel X7550. These are Nehalem ("Beckton") Xeons with memory controller integrated into the CPU and hence a 4-node NUMA system. Threads running on one socket and accessing memory located on another socket will run somewhat slower. The same is also true for a serial process that might get migrated to another socket by some stupid scheduler decision. Binding in such a system is very important as you may see from the timings:
multByTwo
with map
CPU elapsed time is 4.270000 seconds
Actual elapsed time is 4.264875 seconds
multByTwo
with map
bound to NUMA node 0
CPU elapsed time is 4.160000 seconds
Actual elapsed time is 4.160180 seconds
multByTwo
with map
bound to NUMA node 0 and CPU socket 1
CPU elapsed time is 5.910000 seconds
Actual elapsed time is 5.912319 seconds
mutlByTwo
with parallelMap
on 3 cores
CPU elapsed time is 7.530000 seconds
Actual elapsed time is 3.696616 seconds
Parallel speedup is only 1.13x (relative to the fastest node-bound serial execution). Now with binding:
multByTwo
with parallelMap
on 3 cores bound to NUMA node 0
CPU elapsed time is 4.630000 seconds
Actual elapsed time is 1.548102 seconds
Parallel speedup is 2.69x - as much as for the Tigerton CPUs.
multByTwo
with parallelMap
on 3 cores bound to NUMA node 0 and CPU socket 1
CPU elapsed time is 5.190000 seconds
Actual elapsed time is 1.760623 seconds
Parallel speedup is 2.36x - 88% of the previous case.
(I was too impatient to wait for the doJunk
code to finish on the relatively slower Nehalems but I would expect somewhat better performance as was in Tigerton case)
There is one caveat with NUMA binding though. If you force e.g. binding to NUMA node 0 with numactl --cpubind=0 --membind=0 ./program
this will limit memory allocation to this node only and on your particular system the memory attached to CPU 0 might not be enough and a run-time failure will most likely occur.
As you can see there are factors, other than the overhead from creating threads, that can significantly influence your code execution time. Also on very fast systems the overhead can be too high compared to the computational work done by each thread. That's why when asking questions concerning parallel performance, one should always include as much details as possible about the hardware and the environment used to measure the performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With