Parallelizing a very tight loop

Tags:

I've been banging my head for hours on this one and I always end up with thread contention eating up any performance improvements of parallelizing my loop.

I'm trying to calculate a histogram of a 8 bit grayscale gigapixel image. People who have read the book "CUDA by Example" will probably know where this is coming from (Chapter 9).

The method is very very simple (resulting in a very tight loop). It's basically just

    private static void CalculateHistogram(uint[] histo, byte[] buffer) 
    {
        foreach (byte thisByte in buffer) 
        {
            // increment the histogram at the position
            // of the current array value
            histo[thisByte]++;
        }
    }

where buffer is an array of 1024^3 elements.

On a somewhat recent Sandy Bridge-EX CPU building a histogram of 1 billion elements takes 1 second running on one core.

Anyways, I tried speeding up the calculation by distributing the loop among all my cores and end up with a solution 50 times slower.

    private static void CalculateHistrogramParallel(byte[] buffer, ref int[] histo) 
    {
        // create a variable holding a reference to the histogram array
        int[] histocopy = histo;

        var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };

        // loop through the buffer array in parallel
        Parallel.ForEach(
            buffer,
            parallelOptions,
            thisByte => Interlocked.Increment(ref histocopy[thisByte]));
    }

Quite obviously because of the performance impact of the atomic increment.

No matter what I tried (like range partitioners [http://msdn.microsoft.com/en-us/library/ff963547.aspx], concurrent collections [http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx], and so on) it boils down to the fact that I'm reducing one billion elements to 256 elements and I always end up in a race condition while trying to access my histogram array.

My last try was to use a range partitioner like

       var rangePartitioner = Partitioner.Create(0, buffer.Length);

        Parallel.ForEach(rangePartitioner, parallelOptions, range => 
        {
            var temp = new int[256];
            for (long i = range.Item1; i < range.Item2; i++) 
            {
                temp[buffer[i]]++;
            }
        });

to calculate sub-histograms. But in the end, I'm still having the problem that I have to merge all those sub-histograms, and bang, thread contention again.

I refuse to believe that there is no way to speed things up by parallelizing, even if it's such a tight loop. If its possible on the GPU, it must be - to some extent - possible on the CPU as well.

What else, except giving up, is there to try?

I've searched stackoverflow and the interwebs quite a bit but this seems to be an edge case for parallelism.

389

asked Jul 18 '14 09:07

lightxx

1 Answers

You should use one of the Parallel.ForEach loops that has a local state.

Each seperate partition of a parallelized loop has a unique local state, which means it doesn't need synchronization. As a final action you have to aggregate every local state into the final value. This step requires synchronization but is only called once for every partition instead of once per iteration.

Instead of

Parallel.ForEach(
    buffer,
    parallelOptions,
    thisByte => Interlocked.Increment(ref histocopy[thisByte]));

you can use

Parallel.ForEach(
    buffer,
    parallelOptions,
    () => new int[histocopy.Length], // initialize local histogram
    (thisByte, state, local) => local[thisByte]++, // increment local histogram
    local =>
    {
        lock(histocopy) // add local histogram to global
        {
            for (int idx = 0; idx < histocopy.Length; idx++)
            {
                histocopy[idx] += local[idx];
            }
        }
    }

It might also be a good idea to start with the default options for partition size and parallel options and optimize from there.

120

answered Sep 18 '22 04:09

Dirk

Related questions
                            
                                Can a type alias refer to another type alias? [duplicate]
                            
                                UIHint can not resolve template in abstract models
                            
                                DataGridTemplateColumn Header Binding
                            
                                Is it possible to use Chromium Embedded Framework (CEF) inside Windows Store Apps [closed]
                            
                                Force no BOM when saving XML
                            
                                EF6 Code First drop tables (not entire database) when model changes
                            
                                SignalR continuous messaging
                            
                                Hierarchical object and AutoFixture
                            
                                How to trigger a Generic Class method recursively changing the type of T?
                            
                                Invisible opened popup
                            
                                ReSharper Refactor > Move doesn't work
                            
                                Postsharp 3rd party class
                            
                                ASP.NET MVC model nesting routes
                            
                                Send eMail in Windows Universal App
                            
                                WPF append text blocks UI thread heavily but WinForms doesn't?
                            
                                Process.Start without creating a child process (port handle inheritance)?
                            
                                How to bring cell in edit mode as soon as it gets focus
                            
                                Resizing image in C#
                            
                                C# what difference enum in namespace than enum inside a class
                            
                                How to get the stream for a Multipart file in webapi upload?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallelizing a very tight loop

Tags:

performance

c#

multithreading

parallel-processing

parallel.foreach

lightxx

People also ask

1 Answers

Dirk

Recent Activity

Donate For Us