I have got this seemingly trivial parallel quicksort implementation, the code is as follows: <pre class="prettyprint"><code>import System.Random import Control.Parallel import Data.List quicksort :: Ord a => [a] -> [a] quicksort xs = pQuicksort 16 xs -- 16 is the number of sparks used to sort -- pQuicksort, parallelQuicksort -- As long as n > 0 evaluates the lower and upper part of the list in parallel, -- when we have recursed deep enough, n==0, this turns into a serial quicksort. pQuicksort :: Ord a => Int -> [a] -> [a] pQuicksort _ [] = [] pQuicksort 0 (x:xs) = let (lower, upper) = partition (< x) xs in pQuicksort 0 lower ++ [x] ++ pQuicksort 0 upper pQuicksort n (x:xs) = let (lower, upper) = partition (< x) xs l = pQuicksort (n `div` 2) lower u = [x] ++ pQuicksort (n `div` 2) upper in (par u l) ++ u main :: IO () main = do gen <- getStdGen let randints = (take 5000000) $ randoms gen :: [Int] putStrLn . show . sum $ (quicksort randints) </code></pre> I compile with <pre class="prettyprint"><code>ghc --make -threaded -O2 quicksort.hs </code></pre> and run with <pre class="prettyprint"><code>./quicksort +RTS -N16 -RTS </code></pre> No matter what I do I can not get this to run faster than a simple sequential implementation running on one cpu. <ol> <li>Is it possible to explain why this runs so much slower on several CPUs than on one?</li> <li>Is it possible to make this scale, at least sub linearly, with the number of CPUs by doing some trick?</li> </ol> EDIT: @tempestadept hinted that quick sort it self is the problem. To check this I implemented a simple merge sort in the same spirit as the example above. It has the same behaviour, performs slower the more capabilities you add. <pre class="prettyprint"><code>import System.Random import Control.Parallel splitList :: [a] -> ([a], [a]) splitList = helper True [] [] where helper _ left right [] = (left, right) helper True left right (x:xs) = helper False (x:left) right xs helper False left right (x:xs) = helper True left (x:right) xs merge :: (Ord a) => [a] -> [a] -> [a] merge xs [] = xs merge [] ys = ys merge (x:xs) (y:ys) = case x<y of True -> x : merge xs (y:ys) False -> y : merge (x:xs) ys mergeSort :: (Ord a) => [a] -> [a] mergeSort xs = pMergeSort 16 xs -- we use 16 sparks -- pMergeSort, parallel merge sort. Takes an extra argument -- telling how many sparks to create. In our simple test it is -- set to 16 pMergeSort :: (Ord a) => Int -> [a] -> [a] pMergeSort _ [] = [] pMergeSort _ [a] = [a] pMergeSort 0 xs = let (left, right) = splitList xs in merge (pMergeSort 0 left) (pMergeSort 0 right) pMergeSort n xs = let (left, right) = splitList xs l = pMergeSort (n `div` 2) left r = pMergeSort (n `div` 2) right in (r `par` l) `pseq` (merge l r) ris :: Int -> IO [Int] ris n = do gen <- getStdGen return . (take n) $ randoms gen main = do r <- ris 100000 putStrLn . show . sum $ mergeSort r </code></pre>

There are couple of problems that have already been mentioned: <ul> <li>Using lists is not going to give the performance you are looking for. Even this sample implementation using vector is factor x50 faster than using lists, since it does in-place element swaps. For this reason my answer will include implementation using the array library <code>massiv</code>, rather than lists.</li> <li>I tend to find Haskell scheduler far from perfect for CPU bound tasks, so, as @Edward Kmett noted in his answer, we need a work stealing scheduler, which I conveniently implemented for the above mentioned library: <code>scheduler</code> </li> </ul> <pre class="prettyprint lang-hs prettyprint-override"><code>-- A helper function that partitions a region of a mutable array. unstablePartitionRegionM :: forall r e m. (Mutable r Ix1 e, PrimMonad m) => MArray (PrimState m) r Ix1 e -> (e -> Bool) -> Ix1 -- ^ Start index of the region -> Ix1 -- ^ End index of the region -> m Ix1 unstablePartitionRegionM marr f start end = fromLeft start (end + 1) where fromLeft i j | i == j = pure i | otherwise = do x <- A.unsafeRead marr i if f x then fromLeft (i + 1) j else fromRight i (j - 1) fromRight i j | i == j = pure i | otherwise = do x <- A.unsafeRead marr j if f x then do A.unsafeWrite marr j =<< A.unsafeRead marr i A.unsafeWrite marr i x fromLeft (i + 1) j else fromRight i (j - 1) {-# INLINE unstablePartitionRegionM #-} </code></pre> Here is the actual in-place quicksort <pre class="prettyprint"><code>quicksortMArray :: (Ord e, Mutable r Ix1 e, PrimMonad m) => Int -> (m () -> m ()) -> A.MArray (PrimState m) r Ix1 e -> m () quicksortMArray numWorkers schedule marr = schedule $ qsort numWorkers 0 (unSz (msize marr) - 1) where qsort n !lo !hi = when (lo < hi) $ do p <- A.unsafeRead marr hi l <- unstablePartitionRegionM marr (< p) lo hi A.unsafeWrite marr hi =<< A.unsafeRead marr l A.unsafeWrite marr l p if n > 0 then do let !n' = n - 1 schedule $ qsort n' lo (l - 1) schedule $ qsort n' (l + 1) hi else do qsort n lo (l - 1) qsort n (l + 1) hi {-# INLINE quicksortMArray #-} </code></pre> Now if we look at the arguments <code>numWorkers</code> and <code>schedule</code> they are pretty opaque. Say if we supply <code>1</code> for the first argument and <code>id</code> for the second one, we will simply have a sequential quicksort, but if we would have a function available to us that could schedule each task to be computed concurrently, then we would get a parallel implementation of a quicksort. Luckily for us <code>massiv</code> provides it out of the box <code>withMArray</code>: <pre class="prettyprint"><code>withMArray :: (Mutable r ix e, MonadUnliftIO m) => Array r ix e -> (Int -> (m () -> m ()) -> MArray RealWorld r ix e -> m a) -> m (Array r ix e) </code></pre> Here is a pure version that will make a copy of an array and than sort it in palce using the computation strategy specified within the array itself: <pre class="prettyprint"><code>quicksortArray :: (Mutable r Ix1 e, Ord e) => Array r Ix1 e -> Array r Ix1 e quicksortArray arr = unsafePerformIO $ withMArray arr quicksortMArray {-# INLINE quicksortArray #-} </code></pre> Here comes the best part, the benchmarks. The order of results: <ul> <li>Intro sort from <code>vector-algorithms</code> </li> <li>In-place quicksort using vector from this answer </li> <li>Implementation in <code>C</code>, which I grabbed from this question </li> <li>Sequential quicksort using <code>massiv</code> </li> <li>Same as above, but in parallel on a computer with a humble 3rd gen i7 quad core processor with hyperthreading</li> </ul> <pre class="prettyprint"><code>benchmarking QuickSort/Vector Algorithms time 101.3 ms (93.75 ms .. 107.8 ms) 0.991 R² (0.974 R² .. 1.000 R²) mean 97.13 ms (95.17 ms .. 100.2 ms) std dev 4.127 ms (2.465 ms .. 5.663 ms) benchmarking QuickSort/Vector time 89.51 ms (87.69 ms .. 91.92 ms) 0.999 R² (0.997 R² .. 1.000 R²) mean 92.67 ms (91.54 ms .. 94.50 ms) std dev 2.438 ms (1.468 ms .. 3.493 ms) benchmarking QuickSort/C time 88.14 ms (86.71 ms .. 89.41 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 90.11 ms (89.17 ms .. 93.35 ms) std dev 2.744 ms (387.1 μs .. 4.686 ms) benchmarking QuickSort/Array time 76.07 ms (75.77 ms .. 76.41 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 76.08 ms (75.75 ms .. 76.28 ms) std dev 453.7 μs (247.8 μs .. 699.6 μs) benchmarking QuickSort/Array Par time 25.25 ms (24.84 ms .. 25.61 ms) 0.999 R² (0.997 R² .. 1.000 R²) mean 25.13 ms (24.80 ms .. 25.75 ms) std dev 991.6 μs (468.5 μs .. 1.782 ms) </code></pre> Benchmarks are sorting 1,000,000 random <code>Int64</code>s. If you'd like to see full code you can find it here: https://github.com/lehins/haskell-quicksort To sum it up, we got a x3 time speed up on a quad core processor and 8 capabilities, which sounds pretty good to me. Thanks for this question, now I can add sorting function to <code>massiv</code> ;) Edit Note, that the accepted answer which uses lists instead of a more appropriate data structure for this problem such as a mutable array, is x100 times slower on the same input: <pre class="prettyprint"><code>benchmarking List/random/List Par time 2.712 s (2.566 s .. 3.050 s) 0.998 R² (0.996 R² .. 1.000 R²) mean 2.696 s (2.638 s .. 2.745 s) std dev 59.09 ms (40.83 ms .. 72.04 ms) variance introduced by outliers: 19% (moderately inflated) </code></pre>

Is it possible to speed up a quicksort with par in Haskell?

Tags:

profiling

haskell

parallel-processing

quicksort

I have got this seemingly trivial parallel quicksort implementation, the code is as follows:

import System.Random
import Control.Parallel
import Data.List

quicksort :: Ord a => [a] -> [a]
quicksort xs = pQuicksort 16 xs -- 16 is the number of sparks used to sort

-- pQuicksort, parallelQuicksort  
-- As long as n > 0 evaluates the lower and upper part of the list in parallel,
-- when we have recursed deep enough, n==0, this turns into a serial quicksort.
pQuicksort :: Ord a => Int -> [a] -> [a]
pQuicksort _ [] = []
pQuicksort 0 (x:xs) =
  let (lower, upper) = partition (< x) xs
  in pQuicksort 0 lower ++ [x] ++ pQuicksort 0 upper
pQuicksort n (x:xs) =
  let (lower, upper) = partition (< x) xs
      l = pQuicksort (n `div` 2) lower
      u = [x] ++ pQuicksort (n `div` 2) upper
  in (par u l) ++ u

main :: IO ()
main = do
  gen <- getStdGen
  let randints = (take 5000000) $ randoms gen :: [Int]
  putStrLn . show . sum $ (quicksort randints)

I compile with

ghc --make -threaded -O2 quicksort.hs

and run with

./quicksort +RTS -N16 -RTS

No matter what I do I can not get this to run faster than a simple sequential implementation running on one cpu.

Is it possible to explain why this runs so much slower on several CPUs than on one?
Is it possible to make this scale, at least sub linearly, with the number of CPUs by doing some trick?

EDIT: @tempestadept hinted that quick sort it self is the problem. To check this I implemented a simple merge sort in the same spirit as the example above. It has the same behaviour, performs slower the more capabilities you add.

import System.Random
import Control.Parallel

splitList :: [a] -> ([a], [a])
splitList = helper True [] []
  where helper _ left right [] = (left, right)
        helper True  left right (x:xs) = helper False (x:left) right xs
        helper False left right (x:xs) = helper True  left (x:right) xs

merge :: (Ord a) => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) = case x<y of
  True  -> x : merge xs (y:ys)
  False -> y : merge (x:xs) ys

mergeSort :: (Ord a) => [a] -> [a]
mergeSort xs = pMergeSort 16 xs -- we use 16 sparks

-- pMergeSort, parallel merge sort. Takes an extra argument
-- telling how many sparks to create. In our simple test it is
-- set to 16
pMergeSort :: (Ord a) => Int -> [a] -> [a]
pMergeSort _ [] = []
pMergeSort _ [a] = [a]
pMergeSort 0 xs =
  let (left, right) = splitList xs
  in  merge (pMergeSort 0 left) (pMergeSort 0 right)
pMergeSort n xs =
  let (left, right) = splitList xs
      l = pMergeSort (n `div` 2) left
      r = pMergeSort (n `div` 2) right
  in  (r `par` l) `pseq` (merge l r)

ris :: Int -> IO [Int]
ris n = do
  gen <- getStdGen
  return . (take n) $ randoms gen

main = do
  r <- ris 100000
  putStrLn . show . sum $ mergeSort r

508

asked Nov 03 '13 12:11

lysgaard

1 Answers

There are couple of problems that have already been mentioned:

Using lists is not going to give the performance you are looking for. Even this sample implementation using vector is factor x50 faster than using lists, since it does in-place element swaps. For this reason my answer will include implementation using the array library massiv, rather than lists.
I tend to find Haskell scheduler far from perfect for CPU bound tasks, so, as @Edward Kmett noted in his answer, we need a work stealing scheduler, which I conveniently implemented for the above mentioned library: scheduler

-- A helper function that partitions a region of a mutable array.
unstablePartitionRegionM ::
     forall r e m. (Mutable r Ix1 e, PrimMonad m)
  => MArray (PrimState m) r Ix1 e
  -> (e -> Bool)
  -> Ix1 -- ^ Start index of the region
  -> Ix1 -- ^ End index of the region
  -> m Ix1
unstablePartitionRegionM marr f start end = fromLeft start (end + 1)
  where
    fromLeft i j
      | i == j = pure i
      | otherwise = do
        x <- A.unsafeRead marr i
        if f x
          then fromLeft (i + 1) j
          else fromRight i (j - 1)
    fromRight i j
      | i == j = pure i
      | otherwise = do
        x <- A.unsafeRead marr j
        if f x
          then do
            A.unsafeWrite marr j =<< A.unsafeRead marr i
            A.unsafeWrite marr i x
            fromLeft (i + 1) j
          else fromRight i (j - 1)
{-# INLINE unstablePartitionRegionM #-}

Here is the actual in-place quicksort

quicksortMArray ::
     (Ord e, Mutable r Ix1 e, PrimMonad m)
  => Int
  -> (m () -> m ())
  -> A.MArray (PrimState m) r Ix1 e
  -> m ()
quicksortMArray numWorkers schedule marr =
  schedule $ qsort numWorkers 0 (unSz (msize marr) - 1)
  where
    qsort n !lo !hi =
      when (lo < hi) $ do
        p <- A.unsafeRead marr hi
        l <- unstablePartitionRegionM marr (< p) lo hi
        A.unsafeWrite marr hi =<< A.unsafeRead marr l
        A.unsafeWrite marr l p
        if n > 0
          then do
            let !n' = n - 1
            schedule $ qsort n' lo (l - 1)
            schedule $ qsort n' (l + 1) hi
          else do
            qsort n lo (l - 1)
            qsort n (l + 1) hi
{-# INLINE quicksortMArray #-}

Now if we look at the arguments numWorkers and schedule they are pretty opaque. Say if we supply 1 for the first argument and id for the second one, we will simply have a sequential quicksort, but if we would have a function available to us that could schedule each task to be computed concurrently, then we would get a parallel implementation of a quicksort. Luckily for us massiv provides it out of the box withMArray:

withMArray ::
     (Mutable r ix e, MonadUnliftIO m)
  => Array r ix e
  -> (Int -> (m () -> m ()) -> MArray RealWorld r ix e -> m a)
  -> m (Array r ix e)

Here is a pure version that will make a copy of an array and than sort it in palce using the computation strategy specified within the array itself:

quicksortArray :: (Mutable r Ix1 e, Ord e) => Array r Ix1 e -> Array r Ix1 e
quicksortArray arr = unsafePerformIO $ withMArray arr quicksortMArray
{-# INLINE quicksortArray #-}

Here comes the best part, the benchmarks. The order of results:

Intro sort from vector-algorithms
In-place quicksort using vector from this answer
Implementation in C, which I grabbed from this question
Sequential quicksort using massiv
Same as above, but in parallel on a computer with a humble 3rd gen i7 quad core processor with hyperthreading

benchmarking QuickSort/Vector Algorithms
time                 101.3 ms   (93.75 ms .. 107.8 ms)
                     0.991 R²   (0.974 R² .. 1.000 R²)
mean                 97.13 ms   (95.17 ms .. 100.2 ms)
std dev              4.127 ms   (2.465 ms .. 5.663 ms)

benchmarking QuickSort/Vector  
time                 89.51 ms   (87.69 ms .. 91.92 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 92.67 ms   (91.54 ms .. 94.50 ms)
std dev              2.438 ms   (1.468 ms .. 3.493 ms)

benchmarking QuickSort/C       
time                 88.14 ms   (86.71 ms .. 89.41 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 90.11 ms   (89.17 ms .. 93.35 ms)
std dev              2.744 ms   (387.1 μs .. 4.686 ms)

benchmarking QuickSort/Array   
time                 76.07 ms   (75.77 ms .. 76.41 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 76.08 ms   (75.75 ms .. 76.28 ms)
std dev              453.7 μs   (247.8 μs .. 699.6 μs)

benchmarking QuickSort/Array Par
time                 25.25 ms   (24.84 ms .. 25.61 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 25.13 ms   (24.80 ms .. 25.75 ms)
std dev              991.6 μs   (468.5 μs .. 1.782 ms)

Benchmarks are sorting 1,000,000 random Int64s. If you'd like to see full code you can find it here: https://github.com/lehins/haskell-quicksort

To sum it up, we got a x3 time speed up on a quad core processor and 8 capabilities, which sounds pretty good to me. Thanks for this question, now I can add sorting function to massiv ;)

Edit

Note, that the accepted answer which uses lists instead of a more appropriate data structure for this problem such as a mutable array, is x100 times slower on the same input:

benchmarking List/random/List Par
time                 2.712 s    (2.566 s .. 3.050 s)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 2.696 s    (2.638 s .. 2.745 s)
std dev              59.09 ms   (40.83 ms .. 72.04 ms)
variance introduced by outliers: 19% (moderately inflated)

135

answered Oct 07 '22 22:10

lehins

Related questions
                            
                                Understanding the Reader monad
                            
                                What are the requirements to prefer CPS over algebraic data types?
                            
                                Compile-time assertions with GHC Haskell?
                            
                                Haskell functional dependency conflict
                            
                                what's this equation with lambda notation " m >> n = m >>= \_ -> n " in monad's declaration?
                            
                                NLP Parser in Haskell [closed]
                            
                                Monad m => a -> [a -> m a] -> m a
                            
                                Haskell - List of instances of a Typeclass
                            
                                Recursion Schemes in Agda
                            
                                Haskell space usage compile time restrictions
                            
                                How to correct my OOP tendencies when programming in Haskell
                            
                                How to implement B+ tree in Haskell?
                            
                                Optimising Haskell data reading from file
                            
                                What does it mean that non pure functions break composability?
                            
                                Is there a real-world applicability for the continuation monad outside of academic use?
                            
                                Elixir: generate a list of `n` occurrences of a given argument (similar to Haskell's replicate)
                            
                                types and type variable in Haskell
                            
                                What does "this function will fuse" mean?
                            
                                MySQL with Haskell
                            
                                Why does ghc generate so many useless-looking nested case expressions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With