Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize small pure function?

I have D2 program that, in its current form, is single threaded, and calls the same pure function about 10 to 100 times in an inner loop for each iteration of the outer loop of this program. There is no data dependency among the calls, i.e. no call uses the result from any other call. Overall, this function is called millions of times, and is the main bottleneck in my program. The parameters are unique almost every time, so caching wouldn't help.

At first glance, this seems like the perfect candidate for parallelization. The only problem is that the function only takes about 3 microseconds per call, well below the latency of creating a new thread, and not that far above the overhead of adding a job to a task pool (meaning, acquiring a mutex, allocating memory to hold information about the task, dealing with possible contention for the task pool's queue, etc.). Is there any good way to take advantage of parallelism that is this fine-grained?

like image 893
dsimcha Avatar asked Feb 24 '09 00:02

dsimcha


4 Answers

What about creating multiple threads which have their own queue to work from? Because there is no overlapping of the queues you shouldn't have to create locks.

like image 65
Georg Schölly Avatar answered Nov 14 '22 17:11

Georg Schölly


Don't start each thread up to run a single task then shut it right down.

At the beginning of your program, create a thread for each core just sitting there waiting for data from a queue (pipe, or some mechanism of your own creation). If you can come up with a mechanism where all the threads wait on the same queue, even better, but then the queue's get method will have to be synchronized...

Whenever you have a block of a few hundreds or thousands of your processes to be calculated, drop the entire block into the next empty queue.

You'll actually end up with one or more threads feeding the queues, a bunch of threads processing data from the queues, and one or more reading and dealing with results.

You may need to put enough data in the "items" you are processing to be able to tell what to do with them after you are done. They should almost certainly be an object, and you may want them to contain state information.

You probably don't want more threads doing processing than you have cores.

Edit: Also look at some of the concurrent library, like ThreadPoolExecutor. It's easy to forget the concurrent library (like I just did), that's probably exactly what you were looking for (hence the emphasis)

like image 3
Bill K Avatar answered Nov 14 '22 15:11

Bill K


As suggested above, don't kick off a thread every time you enter this function, and moreover have a "job" granularity larger than one operation of the inner function so that the job-creation overhead is well amortized. Describing your original routine as something like:

void OuterFunction( Thingy inputData[N] )
{
  for ( int i = 0 ; i < N ; ++i )
    InnerFunction( inputData[i] );
}

We'd solve your problem by (assuming a job queue system is present):

void JobFunc( Thingy inputData[], int start, int stop )
{
  for ( int i = start ; i < stop ; ++i )
    InnerFunction( inputData[i] );  
}
void OuterFunction( Thingy inputData[N], int numCores )
{
   int perCore = N / numCores; // assuming N%numCores=0 
                               // (omitting edge case for clarity)
   for ( int c = 0 ; c < numCores ; ++c )
     QueueJob( JobFunc, inputData, c * perCore, (c + 1) * perCore );
}

So long as your input data is completely independent, as you say in your original question, you needn't lock it; synchronization is only necessary when there is dependency between the threads and here there is none.

Also, at this level of performance microoptimizations start to become relevant: most importantly, cache locality. Prefetching can get you a surprisingly long way.

Then consider the possibility of SIMD you can vectorize it to run four input points through a single register simultaneously. With four cores and 4-wide SIMD you can theoretically get a 16x speedup, but this assumes that the work InnerFunction is doing is mostly a fixed mathematical function, since branching tends to obliterate SSE/VMX performance gains.

like image 2
Crashworks Avatar answered Nov 14 '22 17:11

Crashworks


What a fun question... as you've noted you won't be able to afford the overheads associated with traditional locking for a work queue for this. I'd encourage you to try to use one of the existing fine-grained task based programming environments if you can... I think about this in three buckets of work:

The first chunk of the problem is to ensure safety, correctness and parallelizability, and it sounds like you have that covered because your function is pure.

I think the next most challenging portion is describing the concurrency, specifically you mention this function is called many many times. Can you pipeline this and separate scheduling the function from its work? If you can't pipeline this, does it look like a parallel loop, a tree traversal or is it more unstructured than this. Specifically, obeying Amdahl if you can't overlap the work and ensure that there are several instances of it or something else running at the same time, you're effectively serial even though you are pure. Anything that you can do to refactor the work into a pipeline, a recursive tree traversal (or parallel loop) or if you must more unstructured work with explicity dependencies between tasks will help here regardless of the library used.

The last area I think about is to ensure that there is efficient execution on your platform and this involves reducing overheads and contention in both your code and the scheduling code and ensuring that any serial code is absolutely as efficient as possible. If you can't use one of the existing libraries and must build your own, I'd encourage you to look at a work-stealing queue and self guided scheduling algorithms, as you've noted you won't be able to see gains from using a traditional locks because their costs outweigh your function costs and you'll most likely need to look at lock-free techniques to reduce the cost of scheduling and removing a task onto whatever queue you use. You'll also need to pay a lot of attention to sharing and contention both within your scheduling algorithm and within your function, because at this level of granularity in addition to the usual branch misprediction and instruction throughput issues, you'll also need to look at shared state and contention even on reads because they can be sources of contention too.

I'm sorry if this wasn't super specific, but I hope it was useful.

like image 2
Rick Avatar answered Nov 14 '22 17:11

Rick