I'm very new to Haskell, and I have a question about what performance improvements can be had by using impure (mutable) data structures. I'm trying to piece together a few different things I've heard, so please bear with me if my terminology is not entirely correct, or if there are some small errors.
To make this concrete, consider the quicksort algorithm (taken from the Haskell wiki).
quicksort :: Ord a => [a] -> [a] quicksort [] = [] quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater) where lesser = filter (< p) xs greater = filter (>= p) xs
This is not "true quicksort." A "true" quicksort algorithm is in-place, and this is not. This is very memory inefficient.
On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort. An example is given in this stackoverflow answer.
How much faster is the second algorithm than the first? Big O notation doesn't help here, because the performance improvement is going to be from using memory more efficiently, not having a better algorithm (right?). I tired to construct some test cases on my own, but I had difficult getting things running.
An ideal answer would give some idea of what makes the in-place Haskell algorithm faster theoretically, and an example comparison of running times on some test data set.
On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort.
How much faster is the second algorithm than the first?
That depends on the implementation, of course. As can be seen below, for not too short lists, a decent in-place sort on a mutable vector or array is much faster than sorting lists, even if the time for the transformation from and to lists is included (and that conversion makes up the bulk of the time).
However, the list algorithms produce incremental output, while the array/vector algorithms don't produce any result before they have completed, therefore sorting lists can still be preferable.
I don't know exactly what the linked mutable array/vector algorithms did wrong. But they did something quite wrong.
For the mutable vector code, it seems that it used boxed vectors, and it was polymorphic, both can have significant performance impact, though the polymorphism shouldn't matter if the functions are {-# INLINABLE #-}
.
For the IOUArray
code, well, it looks fun, but slow. It uses an IORef
, readArray
and writeArray
and has no obvious strictness. The abysmal times it takes aren't too surprising, then.
Using a more direct translation of the (monomorphic) C code using an STUArray
, with a wrapper to make it work on lists¹,
{-# LANGUAGE BangPatterns #-} module STUQuickSort (stuquick) where import Data.Array.Base (unsafeRead, unsafeWrite) import Data.Array.ST import Control.Monad.ST stuquick :: [Int] -> [Int] stuquick [] = [] stuquick xs = runST (do let !len = length xs arr <- newListArray (0,len-1) xs myqsort arr 0 (len-1) -- Can't use getElems for large arrays, that overflows the stack, wth? let pick acc i | i < 0 = return acc | otherwise = do !v <- unsafeRead arr i pick (v:acc) (i-1) pick [] (len-1)) myqsort :: STUArray s Int Int -> Int -> Int -> ST s () myqsort a lo hi | lo < hi = do let lscan p h i | i < h = do v <- unsafeRead a i if p < v then return i else lscan p h (i+1) | otherwise = return i rscan p l i | l < i = do v <- unsafeRead a i if v < p then return i else rscan p l (i-1) | otherwise = return i swap i j = do v <- unsafeRead a i unsafeRead a j >>= unsafeWrite a i unsafeWrite a j v sloop p l h | l < h = do l1 <- lscan p h l h1 <- rscan p l1 h if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1 | otherwise = return l piv <- unsafeRead a hi i <- sloop piv lo hi swap i hi myqsort a lo (i-1) myqsort a (i+1) hi | otherwise = return ()
and a wrapper around a good sort (Introsort, not quicksort) on unboxed vectors,
module VSort where import Data.Vector.Algorithms.Intro import qualified Data.Vector.Unboxed as U import Control.Monad.ST vsort :: [Int] -> [Int] vsort xs = runST (do v <- U.unsafeThaw $ U.fromList xs sort v s <- U.unsafeFreeze v return $ U.toList s)
I get times more in line with the expectations (Note: For these timings, the random list has been deepseq
ed before calling the sorting algorithm. Without that, the conversion to an STUArray
would be much slower, since it would first evaluate a long list of thunks to determine the length. The fromList
conversion of the vector package doesn't suffer from this problem. Moving the deepseq
to the conversion to STUArray
, the other sorting [and conversion, in the vector case] algorithms take a little less time, so the difference between vector-algorithms' introsort and the STUArray
quicksort becomes a little larger.):
list size: 200000 -O2 -fllvm -fllvm-O2 ──────── ──────── ──────── ──────── ──────── Data.List.sort 0.663501s 0.665482s 0.652461s 0.792005s Naive.quicksort 0.587091s 0.577796s 0.585754s 0.667573s STUArray.quicksort 1.58023s 0.142626s 1.597479s 0.156411s VSort.vsort 0.820639s 0.139967s 0.888566s 0.143918s
The times without optimisation are expectedly bad for the STUArray
. unsafeRead
and unsafeWrite
must be inlined to be fast. If not inlined, you get a dictionary lookup for each call. Thus for the large dataset, I omit the unoptimised ways:
list size: 3000000 -O2 -fllvm-O2 ──────── ──────── ──────── Data.List.sort 16.728576s 16.442377s Naive.quicksort 14.297534s 12.253071s STUArray.quicksort 2.307203s 2.200807s VSort.vsort 2.069749s 1.921943s
You can see that an inplace sort on a mutable unboxed array is much faster than a list-based sort if done correctly. Whether the difference between the The difference between the STUArray
sort and the sort on the unboxed mutable vector is due to the different algorithm or whether vectors are indeed faster here, I don't know. Since I've never observed vectors to be faster² than STUArray
s, I tend to believe the former.STUArray
quicksort and the introsort is in part due to the better conversion from and to lists that the vector package offers, in part due to the different algorithms.
At Louis Wasserman's suggestion, I have run a quick benchmark using the other sorting algorithms from the vector-algorithms package, using a not-too-large dataset. The results aren't surprising, the good general-purpose algorithms heapsort, introsort and mergesort all do well, times near the quicksort on the unboxed mutable array (but of course, the quicksort would degrade to quadratic behaviour on almost sorted input, while these are guaranteed O(n*log n) worst case). The special-purpose sorting algorithms AmericanFlag
and radix sort do badly, since the input doesn't fit well to their purpose (radix sort would do better on larger inputs with a larger range, as is, it does too many more passes than needed for the data). Insertion sort is by far the worst, due to its quadratic behaviour.
AmericanFlag: list size: 300000 -O2 -fllvm-O2 ──────── ──────── ──────── Data.List.sort 1.083845s 1.084699s Naive.quicksort 0.981276s 1.05532s STUArray.quicksort 0.218407s 0.215564s VSort.vsort 2.566838s 2.618817s Heap: list size: 300000 -O2 -fllvm-O2 ──────── ──────── ──────── Data.List.sort 1.084252s 1.07894s Naive.quicksort 0.915984s 0.887354s STUArray.quicksort 0.219786s 0.225748s VSort.vsort 0.213507s 0.20152s Insertion: list size: 300000 -O2 -fllvm-O2 ──────── ──────── ──────── Data.List.sort 1.168837s 1.066058s Naive.quicksort 1.081806s 0.879439s STUArray.quicksort 0.241958s 0.209631s VSort.vsort 36.21295s 27.564993s Intro: list size: 300000 -O2 -fllvm-O2 ──────── ──────── ──────── Data.List.sort 1.09189s 1.112415s Naive.quicksort 0.891161s 0.989799s STUArray.quicksort 0.236596s 0.227348s VSort.vsort 0.221742s 0.20815s Merge: list size: 300000 -O2 -fllvm-O2 ──────── ──────── ──────── Data.List.sort 1.087929s 1.074926s Naive.quicksort 0.875477s 1.019984s STUArray.quicksort 0.215551s 0.221301s VSort.vsort 0.236661s 0.230287s Radix: list size: 300000 -O2 -fllvm-O2 ──────── ──────── ──────── Data.List.sort 1.085658s 1.085726s Naive.quicksort 1.002067s 0.900985s STUArray.quicksort 0.217371s 0.228973s VSort.vsort 1.958216s 1.970619s
Conclusion: Unless you have a specific reason not to, using one of the good general-purpose sorting algorithms from vector-algorithms, with a wrapper to convert from and to lists if necessary, is the recommended way to sort large lists. (These algorithms also work well with boxed vectors, in my measurements approximately 50% slower than unboxed.) For short lists, the overhead of the conversion would be so large that it doesn't pay.
Now, at @applicative's suggestion, a look at the sorting times for vector-algorithms' introsort, a quicksort on unboxed vectors and an improved (shamelessly stealing the implementation of unstablePartition
) quicksort on STUArray
s.
The improved STUArray
quicksort:
{-# LANGUAGE BangPatterns #-} module NQuick (stuqsort) where import Data.Array.Base (unsafeRead, unsafeWrite, getNumElements) import Data.Array.ST import Control.Monad.ST import Control.Monad (when) stuqsort :: STUArray s Int Int -> ST s () stuqsort arr = do n <- getNumElements arr when (n > 1) (myqsort arr 0 (n-1)) myqsort :: STUArray s Int Int -> Int -> Int -> ST s () myqsort a lo hi = do p <- unsafeRead a hi j <- unstablePartition (< p) lo hi a h <- unsafeRead a j unsafeWrite a j p unsafeWrite a hi h when (j > lo+1) (myqsort a lo (j-1)) when (j+1 < hi) (myqsort a (j+1) hi) unstablePartition :: (Int -> Bool) -> Int -> Int -> STUArray s Int Int -> ST s Int {-# INLINE unstablePartition #-} unstablePartition f !lf !rg !v = from_left lf rg where from_left i j | i == j = return i | otherwise = do x <- unsafeRead v i if f x then from_left (i+1) j else from_right i (j-1) from_right i j | i == j = return i | otherwise = do x <- unsafeRead v j if f x then do y <- unsafeRead v i unsafeWrite v i x unsafeWrite v j y from_left (i+1) j else from_right i (j-1)
The vector quicksort:
module VectorQuick (vquicksort) where import qualified Data.Vector.Unboxed.Mutable as UM import qualified Data.Vector.Generic.Mutable as GM import Control.Monad.ST import Control.Monad (when) vquicksort :: UM.STVector s Int -> ST s () vquicksort uv = do let li = UM.length uv - 1 ui = UM.unsafeSlice 0 li uv p <- UM.unsafeRead uv li j <- GM.unstablePartition (< p) ui h <- UM.unsafeRead uv j UM.unsafeWrite uv j p UM.unsafeWrite uv li h when (j > 1) (vquicksort (UM.unsafeSlice 0 j uv)) when (j + 1 < li) (vquicksort (UM.unsafeSlice (j+1) (li-j) uv))
The timing code:
{-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) import System.CPUTime import System.Random import Text.Printf import Data.Array.Unboxed import Data.Array.ST hiding (unsafeThaw) import Data.Array.Unsafe (unsafeThaw) import Data.Array.Base (unsafeAt, unsafeNewArray_, unsafeWrite) import Control.Monad.ST import Control.Monad import qualified Data.Vector.Unboxed as U import qualified Data.Vector.Unboxed.Mutable as UM import NQuick import VectorQuick import qualified Data.Vector.Algorithms.Intro as I nextR :: StdGen -> (Int, StdGen) nextR = randomR (minBound, maxBound) buildArray :: StdGen -> Int -> UArray Int Int buildArray sg size = runSTUArray (do arr <- unsafeNewArray_ (0, size-1) let fill i g | i < size = do let (r, g') = nextR g unsafeWrite arr i r fill (i+1) g' | otherwise = return arr fill 0 sg) buildVector :: StdGen -> Int -> U.Vector Int buildVector sg size = U.fromList $ take size (randoms sg) time :: IO a -> IO () time action = do t0 <- getCPUTime action t1 <- getCPUTime let tm :: Double tm = fromInteger (t1 - t0) * 1e-9 printf "%.3f ms\n" tm stu :: UArray Int Int -> Int -> IO () stu ua sz = do let !sa = runSTUArray (do st <- unsafeThaw ua stuqsort st return st) forM_ [0, sz `quot` 2, sz-1] (print . (sa `unsafeAt`)) intro :: U.Vector Int -> Int -> IO () intro uv sz = do let !sv = runST (do st <- U.unsafeThaw uv I.sort st U.unsafeFreeze st) forM_ [0, sz `quot` 2, sz-1] (print . U.unsafeIndex sv) vquick :: U.Vector Int -> Int -> IO () vquick uv sz = do let !sv = runST (do st <- U.unsafeThaw uv vquicksort st U.unsafeFreeze st) forM_ [0, sz `quot` 2, sz-1] (print . U.unsafeIndex sv) main :: IO () main = do args <- getArgs let !num = case args of (a:_) -> read a _ -> 1000000 !sg <- getStdGen let !ar = buildArray sg num !vc = buildVector sg num !v2 = buildVector sg (foo num) algos = [ ("Intro", intro v2), ("STUArray", stu ar), ("Vquick", vquick vc) ] printf "Created data to be sorted, last elements %d %d %d\n" (ar ! (num-1)) (vc U.! (num-1)) (v2 U.! (num-1)) forM_ algos $ \(name, act) -> do putStrLn name time (act num) -- For the prevention of sharing foo :: Int -> Int foo n | n < 0 = -n | n > 0 = n | otherwise = 3
The results (times only):
$ ./timeSorts 3000000 Intro 587.911 ms STUArray 402.939 ms Vquick 414.936 ms $ ./timeSorts 1000000 Intro 193.970 ms STUArray 131.980 ms Vquick 134.979 ms
The practically identical quicksorts on the STUArray
and the unboxed vector take practically the same time, as expected. (The old quicksort implementation was about 15% slower than the introsort. Comparing to the times above, about 70-75% there was spent converting from/to lists.)
On the random input, the quicksorts perform significantly better than the introsort, but on almost-sorted input, their performance would degrade while introsort wouldn't.
¹ Making the code polymorphic with STUArray
s is a pain at best, doing it with IOUArray
s and having both the sorting and the wrapper {-# INLINABLE #-}
produces the same performance with optimisations - without, the polymorphic code is significantly slower.
² Using the same algorithms, both were always equally fast within the precision of measurement when I compared (not very often).
There's nothing better than a test, right? And the results are not unsurprising: for lists of random integers in range [0 .. 1000000]
,
list size: 200000 ghc -O2 -fllvm -fllvm-O2 ──────── ──────── ──────── ──────── ──────── Data.List.sort 0.878969s 0.883219s 0.878106s 0.888758s Naïve.quicksort 0.711305s 0.870647s 0.845508s 0.919925s UArray_IO.quicksort 9.317783s 1.919583s 9.390687s 1.945072s Vector_Mutable.quicksort 1.48142s 0.823004s 1.526661s 0.806837s
Here, Data.List.sort
is just what it is, Naïve.quicksort
is the algorithm you quoted, UArray_IO.quicksort
and Vector_Mutable.quicksort
are taken from the question you linked to: klapaucius' and Dan Burton's answer which turn out to be very suboptimal performance-wise, see what better Daniel Fischer could do it, both wrapped so as to accept lists (not sure if I got this quite right):
quicksort :: [Int] -> [Int] quicksort l = unsafePerformIO $ do let bounds = (0, length l) arr <- newListArray bounds l :: IO (IOUArray Int Int) uncurry (qsort arr) bounds getElems arr
and
quicksort :: Ord a => [a] -> [a] quicksort = toList . iqsort . fromList
respectively.
As you can see, the naïve algorithm is not far behind the mutable solution with Data.Vector
in terms of speed for sorting a list of random-generated integers, and the IOUArray
is actually much worse. Test was carried out on an Intel i5 laptop running Ubuntu 11.10 x86-64.
Note that this does not mean that a nice list-based program can always keep up with its mutably-implemented equivalents, but GHC sure does a great job at bringing the performance close. Also, it depends of course on the data: these are the times when the random-generated lists to sort contain values in between 0 and 1000 rather than 0 an 1000000 as above, i.e. with many duplicates:
list size: 200000 ghc -O2 -fllvm -fllvm-O2 ──────── ──────── ──────── ──────── ──────── Data.List.sort 0.864176s 0.882574s 0.850807s 0.857957s Naïve.quicksort 1.475362s 1.526076s 1.475557s 1.456759s UArray_IO.quicksort 24.405938s 5.255001s 23.561911s 5.207535s Vector_Mutable.quicksort 3.449168s 1.125788s 3.202925s 1.117741s
Not to speak of pre-sorted arrays.
What's quite interesting, (becomes only apparent with really large sizes, which require rtsopts to increase the stack capacity), is how both mutable implementations become significantly slower with -fllvm -O2
:
list size: 3⋅10⁶ ghc -O1 -fllvm-O1 -O2 -fllvm-O2 ──────── ──────── ──────── ──────── ──────── Data.List.sort 23.897897s 24.138117s 23.708218s 23.631968s Naïve.quicksort 17.068644s 19.547817s 17.640389s 18.113622s UArray_IO.quicksort 35.634132s 38.348955s 37.177606s 49.190503s Vector_Mutable.quicksort 17.286982s 17.251068s 17.361247s 36.840698s
It seems kind of logical to me that the immutable implementations fare better on llvm (doesn't it do everything immutably on some level?), though I don't understand why this only becomes apparent as a slowdown to the mutable versions at high optimisation and large data sizes.
$ cat QSortPerform.hs module Main where import qualified Data.List(sort) import qualified Naïve import qualified UArray_IO import qualified Vector_Mutable import Control.Monad import System.Random import System.Environment sortAlgos :: [ (String, [Int]->[Int]) ] sortAlgos = [ ("Data.List.sort", Data.List.sort) , ("Naïve.quicksort", Naïve.quicksort) , ("UArray_IO.quicksort", UArray_IO.quicksort) , ("Vector_Mutable.quicksort", Vector_Mutable.quicksort) ] main = do args <- getArgs when (length args /= 2) $ error "Need 2 arguments" let simSize = read $ args!!1 randArray <- fmap (take simSize . randomRs(0,1000000)) getStdGen let sorted = case filter ((== args!!0) . fst) sortAlgos of [(_, algo)] -> algo randArray _ -> error $ "Argument must be one of " ++ show (map fst sortAlgos) putStr "First element: "; print $ sorted!!0 putStr "Middle element: "; print $ sorted!!(simSize`div`2) putStr "Last element: "; print $ sorted!!(simSize-1)
which takes the algorithm name and array size on command-line. Runtime comparison was done with this program:
$ cat PerformCompare.hs module Main where import System.Process import System.Exit import System.Environment import Data.Time.Clock import Data.List import Control.Monad import Text.PrettyPrint.Boxes compiler = "ghc" testProgram = "./QSortPerform" flagOpts = [[], ["-O2"], ["-fllvm"], ["-fllvm","-O2"]] algos = ["Data.List.sort","Naïve.quicksort","UArray_IO.quicksort","Vector_Mutable.quicksort"] main = do args <- getArgs let testSize = case args of [numb] -> read numb _ -> 200000 results <- forM flagOpts $ \flags -> do compilerExitC <- verboseSystem compiler $ testProgram : "-fforce-recomp" : flags when (compilerExitC /= ExitSuccess) . error $ "Compiler error \"" ++ show compilerExitC ++"\"" algoCompare <- forM algos $ \algo -> do startTime <- getCurrentTime exitC <- verboseSystem testProgram [algo, show testSize] endTime <- getCurrentTime when (exitC /= ExitSuccess) . error $ "Program error \"" ++ show exitC ++"\"" return . text . show $ diffUTCTime endTime startTime return . vcat right $ text(concat flags) : text("────────") : algoCompare let table = hsep 2 bottom $ vcat left (map text $ ("list size: "++show testSize) : "────────" : algos ) : results printBox table verboseSystem :: String -> [String] -> IO ExitCode verboseSystem cmd args = do putStrLn . unwords $ cmd : args rawSystem cmd args
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With