I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition()
becomes True
, execution among all the threads must immediately stop and a value of True
set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With