(When) are parallel sorts practical and how do you write an efficient one?

Tags:

I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.

My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.

Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:

Click to copy

// I tweaked this number a bunch.  Anything smaller than this and the 
// overhead is smaller than the parallelization gains.
const  smallestToParallelize = 500; 

void quickSort(T)(T[] array) {
    if(array.length < someConstant) {
        insertionSort(array);
        return;
    }

    size_t pivotPosition = partition(array);

    if(array.length >= smallestToParallelize) {
        // Sort left subarray in a task pool thread.
        auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
        quickSort(array[pivotPosition + 1..$]);
        myTask.workWait();
    } else {
        // Regular serial quick sort.
        quickSort(array[0..pivotPosition]);
        quickSort(array[pivotPosition + 1..$]);
    }
}

Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?

Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait() steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.

379

asked Feb 13 '10 22:02

dsimcha

2 Answers

Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...

1) are they useful in the real world.

of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.

think UI code where you need to sort a large dynamic list of strings based on context
think something like a barnes-hut n-bodies sim where you need to sort the particles

2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.

If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.

If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).

183

answered Oct 27 '22 09:10

Rick

I'm no expert but... here is what I'd look at:

First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.

Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).

Other tweaks:

skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.

answered Oct 27 '22 09:10

BCS

Related questions
                            
                                In-depth analysis of the difference between the CPU and GPU [closed]
                            
                                Sporadic problems in running a multi-threaded Java project in Win7
                            
                                log4net + multiple threads + rolling file appender
                            
                                Sqlite python sqlite3.OperationalError: database is locked
                            
                                packaged_task hanging on operator()
                            
                                Best way to call many web services?
                            
                                Concurrent LRU cache implementation
                            
                                std::thread: How to wait (join) for any of the given threads to complete?
                            
                                Java stop executor service once one of his assigned tasks fails for any reason
                            
                                In pthread, how to reliably pass signal to another thread?
                            
                                Submitting a background task from spring mvc app
                            
                                When should i use async/await and when not?
                            
                                NIO Selector: How to properly register new channel while selecting
                            
                                Are there any built-in functions which block on I/O that don't allow other threads to run?
                            
                                Writing a multithreaded mapping iterator in Java
                            
                                Difference between Multithreading and Async program in c#
                            
                                How to properly execute GUI operations in Qt main thread?
                            
                                Does Thread.join() release the lock? Or continue to hold it?
                            
                                Does std::shared_mutex favor writers over readers?
                            
                                Possible to force Delphi threadvar Memory to be Freed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

(When) are parallel sorts practical and how do you write an efficient one?

Tags:

sorting

multithreading

parallel-processing

scalability

d

dsimcha

People also ask

2 Answers

Rick

BCS

Recent Activity

Donate For Us