Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large number of threads in C++ and Efficiency

I've currently written a program in C++ that sometimes uses over 300 threads. In my program, I have an array of structs and the length of the array equals the number of threads. Let's assume that I have 400 structs and therefore, 400 threads.

In a single iteration of a for loop, I apply a function to each of the 400 structs, and this function is executed in a thread. Therefore, I have 400 threads running concurrently. (I am using the boost thread library).

I've tried to give a breakdown of what my code looks like (it is not the actual code):

struct my_struct{
  // Structure's members
};

std::vector<my_struct> my_vec;

void my_fun(my_struct* my_str){
// Operations on my_str
}

int main(){
  std::vector<boost::thread> thr(400);
  for (int k = 0; k < 300; k++){
    for (int i = 0; i < 400; i++){
      thr.at(i) = boost::thread(my_fun, &my_vec.at(i));
      }
    }

    for (int m = 0; m < M; m++){
      thr.at(m).join();
    }
  }
}

The function I am using is computationally intensive, and from the code above, I use 400 threads to do calculations and this is done 300 times. Is there any more efficient way of performing this task? I'm not sure if having so many threads active at a single time may affect performance. I've heard of the threadpool library, but I'm not sure whether it'll provide any benefit to me. Any help is appreciated.

Thank You Very Much.

like image 255
A-A Avatar asked Apr 13 '11 07:04

A-A


2 Answers

There is absolutely no benefit to spawning 400 CPU-bound threads unless you have 400+ processor cores in your target machine.

It would be impossible to tell you with any certainty how to better distribute your workload without knowing what sort of computations you're performing, and on what kind of data.

As a shot in the dark, judging from what you have posted, a first stab would be to use N threads (see below), and divide your 400 objects among them so that each thread is responsible for processing approximately 400/N objects. Each thread can loop 300 times, and on each iteration it can process each of its assigned objects.

N is an arbitrary number; in fact, I recommend trying different values and comparing the performance results. However, unless your threads are performing I/O or other operations that waste time blocking on non-computational operations, N should be no larger than the number of processor cores in your machine (try it and watch your performance drop quickly).

Edit: As per the ongoing discussion, it would be advisable to employ a queue of your objects from which each of your N threads can simply pop as they are ready for more work. The queue will of course need to be thread-safe. For optimal performance, a lock-free queue should be implemented. There's a good paper here. The implementation should be simplified by the fact that you are fully populating the queue once and therefore only need thread-safe reads.

like image 114
kqnr Avatar answered Oct 15 '22 16:10

kqnr


The only way in which it is beneficial to have more threads than there are actual execution engines (CPUs or cores or whatever is being used - I'll just call them CPUs here) is if some of the threads are actually waiting on resources other than those CPUs.

If the threads are CPU-bound, the ideal number is equivalent to the number of CPUs you have available to you. If many of the threads are waiting on file I/O or database accesses or network traffic or OS events (and so on), then a couple of hundred may be okay. But, in your case, that appears not to be so.

A thread pool is really a way to avoid continuously creating and destroying threads in situations where that could be relatively inefficient. For example, if it takes ten seconds to start a thread and each only does one second of work, then a thread pool would be ideal.

Given that you're likely to be reducing the number of threads to something substantially less than four hundred (say about two or four), which will in turn ramp up the work done by each, a thread pool may not be needed. But again, it depends on the amount of work that will be done by the threads compared with their startup cost.

To keep it simple, I'd start with the non pool version and only contemplate changing if there's a serious performance issue. Otherwise you may be giving yourself extra work for no real benefit.

You can still divide up your work into four hundred units but the best approach will be to simply queue them up and have each of your threads pull an item from the queue when it's ready to process one. That way, the work is automatically balanced among CPUs. If, for some bizarre reason, CPU number 1 runs twice as fast as the others, it will automatically get twice as much of the workload.

This is more important than you think, simply because it's a near certainty that the CPUs will be doing other stuff as well - they're unlikely to be totally dedicated to just doing this work.

like image 23
paxdiablo Avatar answered Oct 15 '22 17:10

paxdiablo