In Wikibooks' Haskell, there is the following claim: <blockquote> Data.List offers a sort function for sorting lists. It does not use quicksort; rather, it uses an efficient implementation of an algorithm called mergesort. </blockquote> What is the underlying reason in Haskell to use mergesort over quicksort? Quicksort usually has better practical performance, but maybe not in this case. I gather that the in-place benefits of quicksort are hard (impossible?) to do with Haskell lists. There was a related question on softwareengineering.SE, but it wasn't really about why mergesort is used. I implemented the two sorts myself for profiling. Mergesort was superior (around twice as fast for a list of 2^20 elements), but I'm not sure that my implementation of quicksort was optimal. Edit: Here are my implementations of mergesort and quicksort: <pre class="prettyprint lang-hs prettyprint-override"><code>mergesort :: Ord a => [a] -> [a] mergesort [] = [] mergesort [x] = [x] mergesort l = merge (mergesort left) (mergesort right) where size = div (length l) 2 (left, right) = splitAt size l merge :: Ord a => [a] -> [a] -> [a] merge ls [] = ls merge [] vs = vs merge first@(l:ls) second@(v:vs) | l < v = l : merge ls second | otherwise = v : merge first vs quicksort :: Ord a => [a] -> [a] quicksort [] = [] quicksort [x] = [x] quicksort l = quicksort less ++ pivot:(quicksort greater) where pivotIndex = div (length l) 2 pivot = l !! pivotIndex [less, greater] = foldl addElem [[], []] $ enumerate l addElem [less, greater] (index, elem) | index == pivotIndex = [less, greater] | elem < pivot = [elem:less, greater] | otherwise = [less, elem:greater] enumerate :: [a] -> [(Int, a)] enumerate = zip [0..] </code></pre> Edit <s>2</s> 3: I was asked to provide timings for my implementations versus the sort in <code>Data.List</code>. Following @Will Ness' suggestions, I compiled this gist with the <code>-O2</code> flag, changing the supplied sort in <code>main</code> each time, and executed it with <code>+RTS -s</code>. The sorted list was a cheaply-created, pseudorandom <code>[Int]</code> list with 2^20 elements. The results were as follows: <ul> <li> <code>Data.List.sort</code>: 0.171s</li> <li> <code>mergesort</code>: 1.092s (~6x slower than <code>Data.List.sort</code>)</li> <li> <code>quicksort</code>: 1.152s (~7x slower than <code>Data.List.sort</code>)</li> </ul>

I think @comingstorm's answer is pretty much on the nose, but here's some more info on the history of GHC's sort function. In the source code for <code>Data.OldList</code>, you can find the implementation of <code>sort</code> and verify for yourself that it's a merge sort. Just below the definition in that file is the following comment: <pre class="prettyprint"><code>Quicksort replaced by mergesort, 14/5/2002. From: Ian Lynagh <igloo@earth.li> I am curious as to why the List.sort implementation in GHC is a quicksort algorithm rather than an algorithm that guarantees n log n time in the worst case? I have attached a mergesort implementation along with a few scripts to time it's performance... </code></pre> So, originally a functional quicksort was used (and the function <code>qsort</code> is still there, but commented out). Ian's benchmarks showed that his mergesort was competitive with quicksort in the "random list" case and massively outperformed it in the case of already sorted data. Later, Ian's version was replaced by another implementation that was about twice as fast, according to additional comments in that file. The main issue with the original <code>qsort</code> was that it didn't use a random pivot. Instead it pivoted on the first value in the list. This is obviously pretty bad because it implies performance will be worst case (or close) for sorted (or nearly sorted) input. Unfortunately, there are a couple of challenges in switching from "pivot on first" to an alternative (either random, or -- as in your implementation -- somewhere in "the middle"). In a functional language without side effects, managing a pseudorandom input is a bit of a problem, but let's say you solve that (maybe by building a random number generator into your sort function). You still have the problem that, when sorting an immutable linked list, locating an arbitrary pivot and then partitioning based on it will involve multiple list traversals and sublist copies. I think the only way to realize the supposed benefits of quicksort would be to write the list out to a vector, sort it in place (and sacrifice sort stability), and write it back out to a list. I don't see that that could ever be an overall win. On the other hand, if you already have data in a vector, then an in-place quicksort would definitely be a reasonable option.

Why does Haskell use mergesort instead of quicksort?

Tags:

performance

sorting

haskell

In Wikibooks' Haskell, there is the following claim:

Data.List offers a sort function for sorting lists. It does not use quicksort; rather, it uses an efficient implementation of an algorithm called mergesort.

What is the underlying reason in Haskell to use mergesort over quicksort? Quicksort usually has better practical performance, but maybe not in this case. I gather that the in-place benefits of quicksort are hard (impossible?) to do with Haskell lists.

There was a related question on softwareengineering.SE, but it wasn't really about why mergesort is used.

I implemented the two sorts myself for profiling. Mergesort was superior (around twice as fast for a list of 2^20 elements), but I'm not sure that my implementation of quicksort was optimal.

Edit: Here are my implementations of mergesort and quicksort:

mergesort :: Ord a => [a] -> [a] mergesort [] = [] mergesort [x] = [x] mergesort l = merge (mergesort left) (mergesort right)     where size = div (length l) 2           (left, right) = splitAt size l  merge :: Ord a => [a] -> [a] -> [a] merge ls [] = ls merge [] vs = vs merge first@(l:ls) second@(v:vs)     | l < v = l : merge ls second     | otherwise = v : merge first vs  quicksort :: Ord a => [a] -> [a] quicksort [] = [] quicksort [x] = [x] quicksort l = quicksort less ++ pivot:(quicksort greater)     where pivotIndex = div (length l) 2           pivot = l !! pivotIndex           [less, greater] = foldl addElem [[], []] $ enumerate l           addElem [less, greater] (index, elem)             | index == pivotIndex = [less, greater]             | elem < pivot = [elem:less, greater]             | otherwise = [less, elem:greater]  enumerate :: [a] -> [(Int, a)] enumerate = zip [0..]

Edit 2 3: I was asked to provide timings for my implementations versus the sort in Data.List. Following @Will Ness' suggestions, I compiled this gist with the -O2 flag, changing the supplied sort in main each time, and executed it with +RTS -s. The sorted list was a cheaply-created, pseudorandom [Int] list with 2^20 elements. The results were as follows:

Data.List.sort: 0.171s
mergesort: 1.092s (~6x slower than Data.List.sort)
quicksort: 1.152s (~7x slower than Data.List.sort)

697

asked Sep 08 '18 17:09

Robert D-B

2 Answers

In imperative languages, Quicksort is performed in-place by mutating an array. As you demonstrate in your code sample, you can adapt Quicksort to a pure functional language like Haskell by building singly-linked lists instead, but this is not as fast.

On the other hand, Mergesort is not an in-place algorithm: a straightforward imperative implementation copies the merged data to a different allocation. This is a better fit for Haskell, which by its nature must copy the data anyway.

Let's step back a bit: Quicksort's performance edge is "lore" -- a reputation built up decades ago on machines much different from the ones we use today. Even if you use the same language, this kind of lore needs rechecking from time to time, as the facts on the ground can change. The last benchmarking paper I read on this topic had Quicksort still on top, but its lead over Mergesort was slim, even in C/C++.

Mergesort has other advantages: it doesn't need to be tweaked to avoid Quicksort's O(n^2) worst case, and it is naturally stable. So, if you lose the narrow performance difference due to other factors, Mergesort is an obvious choice.

answered Oct 22 '22 09:10

comingstorm

I think @comingstorm's answer is pretty much on the nose, but here's some more info on the history of GHC's sort function.

In the source code for Data.OldList, you can find the implementation of sort and verify for yourself that it's a merge sort. Just below the definition in that file is the following comment:

Quicksort replaced by mergesort, 14/5/2002.  From: Ian Lynagh <[email protected]>  I am curious as to why the List.sort implementation in GHC is a quicksort algorithm rather than an algorithm that guarantees n log n time in the worst case? I have attached a mergesort implementation along with a few scripts to time it's performance...

So, originally a functional quicksort was used (and the function qsort is still there, but commented out). Ian's benchmarks showed that his mergesort was competitive with quicksort in the "random list" case and massively outperformed it in the case of already sorted data. Later, Ian's version was replaced by another implementation that was about twice as fast, according to additional comments in that file.

The main issue with the original qsort was that it didn't use a random pivot. Instead it pivoted on the first value in the list. This is obviously pretty bad because it implies performance will be worst case (or close) for sorted (or nearly sorted) input. Unfortunately, there are a couple of challenges in switching from "pivot on first" to an alternative (either random, or -- as in your implementation -- somewhere in "the middle"). In a functional language without side effects, managing a pseudorandom input is a bit of a problem, but let's say you solve that (maybe by building a random number generator into your sort function). You still have the problem that, when sorting an immutable linked list, locating an arbitrary pivot and then partitioning based on it will involve multiple list traversals and sublist copies.

I think the only way to realize the supposed benefits of quicksort would be to write the list out to a vector, sort it in place (and sacrifice sort stability), and write it back out to a list. I don't see that that could ever be an overall win. On the other hand, if you already have data in a vector, then an in-place quicksort would definitely be a reasonable option.

answered Oct 22 '22 10:10

K. A. Buhr

Related questions
                            
                                Faster way to initialize arrays via empty matrix multiplication? (Matlab)
                            
                                Why is System.nanoTime() way slower (in performance) than System.currentTimeMillis()?
                            
                                C# Decimal datatype performance
                            
                                Case vs If Else If: Which is more efficient? [duplicate]
                            
                                What makes this function run much slower?
                            
                                jQuery UI Autocomplete Combobox Very Slow With Large Select Lists
                            
                                Performance of try-catch in php
                            
                                what does O(N) mean [duplicate]
                            
                                Why when I transfer a file through SFTP, it takes longer than FTP?
                            
                                Determining if a printer can handle a print job without look-up
                            
                                `std::variant` vs. inheritance vs. other ways (performance)
                            
                                Difference between rdtscp, rdtsc : memory and cpuid / rdtsc?
                            
                                Does the @inline annotation in Scala really help performance?
                            
                                How do I remove unused resources from third-party libraries I’ve included on Android?
                            
                                Most efficient way to forward-fill NaN values in numpy array
                            
                                VisualVM and Self Time
                            
                                Is a JOIN faster than a WHERE?
                            
                                Advantages / Disadvantages of pconnect option in CodeIgniter
                            
                                Running sites on "localhost" is extremely slow
                            
                                When to use each method of launching a subprocess in Ruby

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With