c++ thread overhead

Question

I'm playing around with threads in C++, in particular using them to parallelize a map operation.

Here's the code:

#include <thread>
#include <iostream>
#include <cstdlib>
#include <vector>
#include <math.h>
#include <stdio.h>

double multByTwo(double x){
  return x*2;
}

double doJunk(double x){
  return cos(pow(sin(x*2),3));
}

template <typename T>
void map(T* data, int n, T (*ptr)(T)){
  for (int i=0; i<n; i++)
    data[i] = (*ptr)(data[i]);
}

template <typename T>
void parallelMap(T* data, int n, T (*ptr)(T)){
  int NUMCORES = 3;
  std::vector<std::thread> threads;
  for (int i=0; i<NUMCORES; i++)
    threads.push_back(std::thread(&map<T>, data + i*n/NUMCORES, n/NUMCORES, ptr));
  for (std::thread& t : threads)
    t.join();
}

int main()
{
  int n = 1000000000;
  double* nums = new double[n];
  for (int i=0; i<n; i++)
    nums[i] = i;

  std::cout<<"go"<<std::endl;

  clock_t c1 = clock();

  struct timespec start, finish;
  double elapsed;

  clock_gettime(CLOCK_MONOTONIC, &start);

  // also try with &doJunk
  //parallelMap(nums, n, &multByTwo);
  map(nums, n, &doJunk);

  std::cout << nums[342] << std::endl;

  clock_gettime(CLOCK_MONOTONIC, &finish);

  printf("CPU elapsed time is %f seconds
", double(clock()-c1)/CLOCKS_PER_SEC);

  elapsed = (finish.tv_sec - start.tv_sec);
  elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;

  printf("Actual elapsed time is %f seconds
", elapsed);
}

With multByTwo the parallel version is actually slightly slower (1.01 seconds versus .95 real time), and with doJunk its faster (51 versus 136 real time). This implies to me that

the parallelization is working, and
there is a REALLY large overhead with declaring new threads. Any thoughts as to why the overhead is so large, and how I can avoid it?

Mark Ransom · Accepted Answer

Just a guess: what you're likely seeing is that the multByTwo code is so fast that you're achieving memory saturation. The code will never run any faster no matter how much processor power you throw at it, because it's already going as fast as it can get the bits to and from RAM.

Hristo Iliev · Answer

You did not specify the hardware that you test your program nor the compiler version and the operating system. I did try your code on our four-socket Intel Xeon systems under 64-bit Scientific Linux with g++ 4.7 compiled from source.

First on an older Xeon X7350 system I got the following timings:

multByTwo with map

CPU elapsed time is 6.690000 seconds
Actual elapsed time is 6.691940 seconds

multByTwo with parallelMap on 3 cores

CPU elapsed time is 7.330000 seconds
Actual elapsed time is 2.480294 seconds

The parallel speedup is 2.7x.

doJunk with map

CPU elapsed time is 209.250000 seconds
Actual elapsed time is 209.289025 seconds

doJunk with parallelMap on 3 cores

CPU elapsed time is 220.770000 seconds
Actual elapsed time is 73.900960 seconds

The parallel speedup is 2.83x.

Note that X7350 is from the quite old pre-Nehalem "Tigerton" family with FSB bus and a shared memory controller located in the north bridge. This is a pure SMP system with no NUMA effects.

Then I run your code on a four-socket Intel X7550. These are Nehalem ("Beckton") Xeons with memory controller integrated into the CPU and hence a 4-node NUMA system. Threads running on one socket and accessing memory located on another socket will run somewhat slower. The same is also true for a serial process that might get migrated to another socket by some stupid scheduler decision. Binding in such a system is very important as you may see from the timings:

multByTwo with map

CPU elapsed time is 4.270000 seconds
Actual elapsed time is 4.264875 seconds

multByTwo with map bound to NUMA node 0

CPU elapsed time is 4.160000 seconds
Actual elapsed time is 4.160180 seconds

multByTwo with map bound to NUMA node 0 and CPU socket 1

CPU elapsed time is 5.910000 seconds
Actual elapsed time is 5.912319 seconds

mutlByTwo with parallelMap on 3 cores

CPU elapsed time is 7.530000 seconds
Actual elapsed time is 3.696616 seconds

Parallel speedup is only 1.13x (relative to the fastest node-bound serial execution). Now with binding:

multByTwo with parallelMap on 3 cores bound to NUMA node 0

CPU elapsed time is 4.630000 seconds
Actual elapsed time is 1.548102 seconds

Parallel speedup is 2.69x - as much as for the Tigerton CPUs.

multByTwo with parallelMap on 3 cores bound to NUMA node 0 and CPU socket 1

CPU elapsed time is 5.190000 seconds
Actual elapsed time is 1.760623 seconds

Parallel speedup is 2.36x - 88% of the previous case.

(I was too impatient to wait for the doJunk code to finish on the relatively slower Nehalems but I would expect somewhat better performance as was in Tigerton case)

There is one caveat with NUMA binding though. If you force e.g. binding to NUMA node 0 with numactl --cpubind=0 --membind=0 ./program this will limit memory allocation to this node only and on your particular system the memory attached to CPU 0 might not be enough and a run-time failure will most likely occur.

As you can see there are factors, other than the overhead from creating threads, that can significantly influence your code execution time. Also on very fast systems the overhead can be too high compared to the computational work done by each thread. That's why when asking questions concerning parallel performance, one should always include as much details as possible about the hardware and the environment used to measure the performance.

c++ thread overhead

Tags:

c++

performance

multithreading

c++11

andyInCambridge

2 Answers

Mark Ransom

Hristo Iliev

Recent Activity

Donate For Us

c++ thread overhead

Tags:

c++

performance

multithreading

c++11

andyInCambridge

2 Answers

Mark Ransom

Hristo Iliev

Related questions

Recent Activity

Donate For Us