Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelizing a very tight loop

I've been banging my head for hours on this one and I always end up with thread contention eating up any performance improvements of parallelizing my loop.

I'm trying to calculate a histogram of a 8 bit grayscale gigapixel image. People who have read the book "CUDA by Example" will probably know where this is coming from (Chapter 9).

The method is very very simple (resulting in a very tight loop). It's basically just

    private static void CalculateHistogram(uint[] histo, byte[] buffer) 
    {
        foreach (byte thisByte in buffer) 
        {
            // increment the histogram at the position
            // of the current array value
            histo[thisByte]++;
        }
    }

where buffer is an array of 1024^3 elements.

On a somewhat recent Sandy Bridge-EX CPU building a histogram of 1 billion elements takes 1 second running on one core.

Anyways, I tried speeding up the calculation by distributing the loop among all my cores and end up with a solution 50 times slower.

    private static void CalculateHistrogramParallel(byte[] buffer, ref int[] histo) 
    {
        // create a variable holding a reference to the histogram array
        int[] histocopy = histo;

        var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };

        // loop through the buffer array in parallel
        Parallel.ForEach(
            buffer,
            parallelOptions,
            thisByte => Interlocked.Increment(ref histocopy[thisByte]));
    }

Quite obviously because of the performance impact of the atomic increment.

No matter what I tried (like range partitioners [http://msdn.microsoft.com/en-us/library/ff963547.aspx], concurrent collections [http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx], and so on) it boils down to the fact that I'm reducing one billion elements to 256 elements and I always end up in a race condition while trying to access my histogram array.

My last try was to use a range partitioner like

       var rangePartitioner = Partitioner.Create(0, buffer.Length);

        Parallel.ForEach(rangePartitioner, parallelOptions, range => 
        {
            var temp = new int[256];
            for (long i = range.Item1; i < range.Item2; i++) 
            {
                temp[buffer[i]]++;
            }
        });

to calculate sub-histograms. But in the end, I'm still having the problem that I have to merge all those sub-histograms, and bang, thread contention again.

I refuse to believe that there is no way to speed things up by parallelizing, even if it's such a tight loop. If its possible on the GPU, it must be - to some extent - possible on the CPU as well.

What else, except giving up, is there to try?

I've searched stackoverflow and the interwebs quite a bit but this seems to be an edge case for parallelism.

like image 389
lightxx Avatar asked Jul 18 '14 09:07

lightxx


People also ask

What is a tight loop in programming?

Noun. tight loop (plural tight loops) (programming) A loop which contains few instructions and iterates many times. (programming) Such a loop which heavily uses I/O or processing resources, failing to adequately share them with other programs running in the operating system.

How do you tell if a loop can be parallelized?

A loop is appropriate for explicit parallelization if: It is a DO loop, but not a DO WHILE or Fortran 95 array syntax. The values of array variables for each iteration of the loop do not depend on the values of array variables for any other iteration of the loop.

What is parallel for in C#?

The Parallel. ForEach method splits the work to be done into multiple tasks, one for each item in the collection. Parallel. ForEach is like the foreach loop in C#, except the foreach loop runs on a single thread and processing take place sequentially, while the Parallel.


1 Answers

You should use one of the Parallel.ForEach loops that has a local state.

Each seperate partition of a parallelized loop has a unique local state, which means it doesn't need synchronization. As a final action you have to aggregate every local state into the final value. This step requires synchronization but is only called once for every partition instead of once per iteration.

Instead of

Parallel.ForEach(
    buffer,
    parallelOptions,
    thisByte => Interlocked.Increment(ref histocopy[thisByte]));

you can use

Parallel.ForEach(
    buffer,
    parallelOptions,
    () => new int[histocopy.Length], // initialize local histogram
    (thisByte, state, local) => local[thisByte]++, // increment local histogram
    local =>
    {
        lock(histocopy) // add local histogram to global
        {
            for (int idx = 0; idx < histocopy.Length; idx++)
            {
                histocopy[idx] += local[idx];
            }
        }
    }

It might also be a good idea to start with the default options for partition size and parallel options and optimize from there.

like image 120
Dirk Avatar answered Sep 18 '22 04:09

Dirk