Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dividing loop iterations among threads

I recently wrote a small number-crunching program that basically loops over an N-dimensional grid and performs some calculation at each point.

for (int i1 = 0; i1 < N; i1++)
  for (int i2 = 0; i2 < N; i2++)
    for (int i3 = 0; i3 < N; i3++)
      for (int i4 = 0; i4 < N; i4++)
        histogram[bin_index(i1, i2, i3, i4)] += 1; // see bottom of question

It worked fine, yadda yadda yadda, lovely graphs resulted ;-) But then I thought, I have 2 cores on my computer, why not make this program multithreaded so I could run it twice as fast?

Now, my loops run a total of, let's say, around a billion calculations, and I need some way to split them up among threads. I figure I should group the calculations into "tasks" - say each iteration of the outermost loop is a task - and hand out the tasks to threads. I've considered

  • just giving thread #n all iterations of the outermost loop where i1 % nthreads == n - essentially predetermining which tasks go to which threads
  • trying to set up some mutex-protected variable which holds the parameter(s) (i1 in this case) of the next task that needs executing - assigning tasks to threads dynamically

What reasons are there to choose one approach over the other? Or another approach I haven't thought about? Does it even matter?

By the way, I wrote this particular program in C, but I imagine I'll be doing the same kind of thing again in other languages as well so answers need not be C-specific. (If anyone knows a C library for Linux that does this sort of thing, though, I'd love to know about it)

EDIT: in this case bin_index is a deterministic function which doesn't change anything except its own local variables. Something like this:

int bin_index(int i1, int i2, int i3, int i4) {
    // w, d, h are constant floats
    float x1 = i1 * w / N,  x2 = i2 * w / N, y1 = i3 * d / N, y2 = i4 * d / N;
    float l = sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2) + h * h);
    float th = acos(h / l);
    // th_max is a constant float (previously computed as a function of w, d, h)
    return (int)(th / th_max);
}

(although I appreciate all the comments, even those which don't apply to a deterministic bin_index)

like image 969
David Z Avatar asked Feb 19 '09 09:02

David Z


3 Answers

The first approach is simple. It is also sufficient if you expect that the load will be balanced evenly over the threads. In some cases, especially if the complexity of bin_index is very dependant on the parameter values, one of the threads could end up with a much heavier task than the rest. Remember: the task is finished when the last threads finishes.

The second approach is a bit more complicated, but balances the load more evenly if the tasks are finegrained enough (the number of tasks is much larger than the number of threads).

Note that you may have issues putting the calculations in separate threads. Make sure that bin_index works correctly when multiple threads execute it simultaneously. Beware of the use of global or static variables for intermediate results.

Also, "histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread, causing the result to be incorrect (if the assignment fetches the value, increments it and stores the resulting value in the array). You could introduce a local histogram for each thread and combine the results to a single histogram when all threads have finished. You could also make sure that only one thread is modifying the histogram at the same time, but that may cause the threads to block each other most of the time.

like image 166
Renze de Waal Avatar answered Sep 29 '22 13:09

Renze de Waal


The first approach is enough. No need for complication here. If you start playing with mutexes you risk making hard to detect errors.

Don't start complicating unless you really see that you need this. Syncronization issues (especially in case of many threads instead of many processes) can be really painful.

like image 40
sharptooth Avatar answered Sep 29 '22 14:09

sharptooth


As I understand it, OpenMP was made just for what you are trying to do, although I have to admit I have not used it yet myself. Basically it seems to boil down to just including a header and adding a pragma clause.

You could probably also use Intel's Thread Building Blocks Library.

like image 28
Adrian Grigore Avatar answered Sep 29 '22 13:09

Adrian Grigore