I was plaing with parallel Haskell functions par
and pseq
and I have discovered something interesting.
My examples base on the examples from Real World Haskell's book (Parallel programming in Haskell):
common code:
import Control.Parallel (par, pseq)
-- <<sorting code goes here>>
force :: [a] -> ()
force xs = go xs `pseq` ()
where go (_:xs) = go xs
go [] = 1
main = do
print $ take 10 $ parSort [0..1000000]
sorting code 1 (taken from the book):
parSort :: (Ord a) => [a] -> [a]
parSort (x:xs) = force greater `par` (force lesser `pseq`
(lesser ++ x:greater))
where lesser = parSort [y | y <- xs, y < x]
greater = parSort [y | y <- xs, y >= x]
parSort _ = []
sorting code 2 (my custom variant):
parSort :: (Ord a) => [a] -> [a]
parSort (x:xs) = force greater `par` (lesser ++ x:greater)
where lesser = parSort [y | y <- xs, y < x]
greater = parSort [y | y <- xs, y >= x]
parSort _ = []
Compile & run with: ghc -O2 -threaded --make Main.hs && time ./Main +RTS -N8
What is interesting, my variant is a little faster than the books one:
sorting code 1 - avg. 16 seconds
sorting code 2 - avg. 14 seconds
I want to ask you why we can observe such behavior and if the book's solution gives any benefits over mine. I would love to deeply understand why this solution could perform better.
I'd say it's because your custom variant doesn't force the first part of the list. Let's have a look at what happens at the top level: You force the right half of the list, but not the left part. And when you print the first 10 elements, you only evaluate lazily first 10 elements of the left part and the rest of it is left unevaluated.
On the other hand, the solution from the book forces both parts, so before you print the first 10 elements, you evaluate both the left and right part.
Instead of printing the first 10 elements, try printing the last one, like
print $ last $ parSort data
then both variants of the algorithm will have to evaluate the whole list. Or force the whole list after sorting it and before printing it.
Note that sorting [0..100000]
with this algorithm will be very inefficient, because you always choose the worst possible pivot and so it takes O(n^2) time. The measurement's won't give meaningful results at all. If you want to get nice results with O(n log n) time, feed the algorithm with random-like data. You can find a simple method how to create a random permutation here.
Note: Instead of using time
I'd suggest you to use criterion to measure your code. You can then measure only relevant parts of your code, excluding initialization etc. and forcing your input and output data so that you measure precisely the one part you're interested in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With