I've been banging my head for hours on this one and I always end up with thread contention eating up any performance improvements of parallelizing my loop.
I'm trying to calculate a histogram of a 8 bit grayscale gigapixel image. People who have read the book "CUDA by Example" will probably know where this is coming from (Chapter 9).
The method is very very simple (resulting in a very tight loop). It's basically just
private static void CalculateHistogram(uint[] histo, byte[] buffer)
{
foreach (byte thisByte in buffer)
{
// increment the histogram at the position
// of the current array value
histo[thisByte]++;
}
}
where buffer is an array of 1024^3 elements.
On a somewhat recent Sandy Bridge-EX CPU building a histogram of 1 billion elements takes 1 second running on one core.
Anyways, I tried speeding up the calculation by distributing the loop among all my cores and end up with a solution 50 times slower.
private static void CalculateHistrogramParallel(byte[] buffer, ref int[] histo)
{
// create a variable holding a reference to the histogram array
int[] histocopy = histo;
var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };
// loop through the buffer array in parallel
Parallel.ForEach(
buffer,
parallelOptions,
thisByte => Interlocked.Increment(ref histocopy[thisByte]));
}
Quite obviously because of the performance impact of the atomic increment.
No matter what I tried (like range partitioners [http://msdn.microsoft.com/en-us/library/ff963547.aspx], concurrent collections [http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx], and so on) it boils down to the fact that I'm reducing one billion elements to 256 elements and I always end up in a race condition while trying to access my histogram array.
My last try was to use a range partitioner like
var rangePartitioner = Partitioner.Create(0, buffer.Length);
Parallel.ForEach(rangePartitioner, parallelOptions, range =>
{
var temp = new int[256];
for (long i = range.Item1; i < range.Item2; i++)
{
temp[buffer[i]]++;
}
});
to calculate sub-histograms. But in the end, I'm still having the problem that I have to merge all those sub-histograms, and bang, thread contention again.
I refuse to believe that there is no way to speed things up by parallelizing, even if it's such a tight loop. If its possible on the GPU, it must be - to some extent - possible on the CPU as well.
What else, except giving up, is there to try?
I've searched stackoverflow and the interwebs quite a bit but this seems to be an edge case for parallelism.
Noun. tight loop (plural tight loops) (programming) A loop which contains few instructions and iterates many times. (programming) Such a loop which heavily uses I/O or processing resources, failing to adequately share them with other programs running in the operating system.
A loop is appropriate for explicit parallelization if: It is a DO loop, but not a DO WHILE or Fortran 95 array syntax. The values of array variables for each iteration of the loop do not depend on the values of array variables for any other iteration of the loop.
The Parallel. ForEach method splits the work to be done into multiple tasks, one for each item in the collection. Parallel. ForEach is like the foreach loop in C#, except the foreach loop runs on a single thread and processing take place sequentially, while the Parallel.
You should use one of the Parallel.ForEach
loops that has a local state.
Each seperate partition of a parallelized loop has a unique local state, which means it doesn't need synchronization. As a final action you have to aggregate every local state into the final value. This step requires synchronization but is only called once for every partition instead of once per iteration.
Instead of
Parallel.ForEach(
buffer,
parallelOptions,
thisByte => Interlocked.Increment(ref histocopy[thisByte]));
you can use
Parallel.ForEach(
buffer,
parallelOptions,
() => new int[histocopy.Length], // initialize local histogram
(thisByte, state, local) => local[thisByte]++, // increment local histogram
local =>
{
lock(histocopy) // add local histogram to global
{
for (int idx = 0; idx < histocopy.Length; idx++)
{
histocopy[idx] += local[idx];
}
}
}
It might also be a good idea to start with the default options for partition size and parallel options and optimize from there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With