Haskell parallel list computation performance

Question

I was plaing with parallel Haskell functions par and pseq and I have discovered something interesting.

My examples base on the examples from Real World Haskell's book (Parallel programming in Haskell):

common code:

import Control.Parallel (par, pseq)

-- <<sorting code goes here>>

force :: [a] -> ()
force xs = go xs `pseq` ()
    where go (_:xs) = go xs
          go [] = 1

main = do
    print $ take 10 $ parSort [0..1000000]

sorting code 1 (taken from the book):

parSort :: (Ord a) => [a] -> [a]
parSort (x:xs)    = force greater `par` (force lesser `pseq`
                                         (lesser ++ x:greater))
    where lesser  = parSort [y | y <- xs, y <  x]
          greater = parSort [y | y <- xs, y >= x]
parSort _         = []

sorting code 2 (my custom variant):

parSort :: (Ord a) => [a] -> [a]
parSort (x:xs)    = force greater `par` (lesser ++ x:greater)
    where lesser  = parSort [y | y <- xs, y <  x]
          greater = parSort [y | y <- xs, y >= x]
parSort _         = []

Compile & run with: ghc -O2 -threaded --make Main.hs && time ./Main +RTS -N8

What is interesting, my variant is a little faster than the books one:

sorting code 1 - avg. 16 seconds
sorting code 2 - avg. 14 seconds

I want to ask you why we can observe such behavior and if the book's solution gives any benefits over mine. I would love to deeply understand why this solution could perform better.

Petr · Accepted Answer

I'd say it's because your custom variant doesn't force the first part of the list. Let's have a look at what happens at the top level: You force the right half of the list, but not the left part. And when you print the first 10 elements, you only evaluate lazily first 10 elements of the left part and the rest of it is left unevaluated.

On the other hand, the solution from the book forces both parts, so before you print the first 10 elements, you evaluate both the left and right part.

Instead of printing the first 10 elements, try printing the last one, like

print $ last $ parSort data

then both variants of the algorithm will have to evaluate the whole list. Or force the whole list after sorting it and before printing it.

Note that sorting [0..100000] with this algorithm will be very inefficient, because you always choose the worst possible pivot and so it takes O(n^2) time. The measurement's won't give meaningful results at all. If you want to get nice results with O(n log n) time, feed the algorithm with random-like data. You can find a simple method how to create a random permutation here.

Note: Instead of using time I'd suggest you to use criterion to measure your code. You can then measure only relevant parts of your code, excluding initialization etc. and forcing your input and output data so that you measure precisely the one part you're interested in.

Haskell parallel list computation performance

Tags:

multithreading

benchmarking

haskell

parallel-processing

Wojciech Danilo

1 Answers

Petr

Recent Activity

Donate For Us

Haskell parallel list computation performance

Tags:

multithreading

benchmarking

haskell

parallel-processing

Wojciech Danilo

1 Answers

Petr

Related questions

Recent Activity

Donate For Us