I'm trying to learn how to use the Control.Parallel
module, but I think I didn't get it right.
I'm trying to run the following code (fibs.hs).
import Control.Parallel
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = p `par` (q `pseq` (p + q))
where
p = fib (n-1)
q = fib (n-2)
main = print $ fib 30
I compiled this with:
ghc -O2 --make -threaded fibs.hs
And then I get the following results executing this program (output of a Python script that runs each program 100 times and returns the average and standard deviation of the execution time):
./fibs +RTS -N1 -> avg= 0.060203 s, deviation = 0.004112 s
./fibs +RTS -N2 -> avg= 0.052335 s, deviation = 0.006713 s
./fibs +RTS -N3 -> avg= 0.052935 s, deviation = 0.006183 s
./fibs +RTS -N4 -> avg= 0.053976 s, deviation = 0.007106 s
./fibs +RTS -N5 -> avg= 0.055227 s, deviation = 0.008598 s
./fibs +RTS -N6 -> avg= 0.055703 s, deviation = 0.006537 s
./fibs +RTS -N7 -> avg= 0.058327 s, deviation = 0.007526 s
My questions are:
What exactly is happening when I evaluate:
a `par` (b `pseq` (a + b)) ?
I understand that a par
b is supposed to hint the compiler about calculating a in parallel with b and return b. OK. But what does pseq
do?
Why do I see such a small performance increase? I'm running this in an Intel Core 2 Quad machine. I'd expect that running with -N5 or -N6 wouldn't make a real difference in performance or that the program would actually start to perform very badly. But why do I see no improvement from -N2 to -N3 and why is the initial improvement so small?
As Don explained, the problem is that you are creating too many sparks. Here's how you might rewrite it to get a good speedup.
import Control.Parallel
cutoff :: Int
cutoff = 20
parFib :: Int -> Int
parFib n | n < cutoff = fib n
parFib n = p `par` q `pseq` (p + q)
where
p = parFib $ n - 1
q = parFib $ n - 2
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
main :: IO ()
main = print $ parFib 40
demonstration:
[computer ~]$ ghc --make -threaded -O2 Main.hs
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking Main ...
[computer ~]$ time ./Main +RTS -N1
102334155
real 0m1.509s
user 0m1.450s
sys 0m0.003s
[computer ~]$ time ./Main +RTS -N2
102334155
real 0m0.776s
user 0m1.487s
sys 0m0.023s
[computer ~]$ time ./Main +RTS -N3
102334155
real 0m0.564s
user 0m1.487s
sys 0m0.030s
[computer ~]$ time ./Main +RTS -N4
102334155
real 0m0.510s
user 0m1.587s
sys 0m0.047s
[computer ~]$
You're creating an exponential number of sparks (think of how many recursive calls you're creating here). To actually get good parallelism you need to create less parallel work in this case, since your hardware can't handle that many threads (and so GHC doesn't make them).
The solution is to use a cutoff strategy, as described in this talk: http://donsbot.wordpress.com/2009/09/05/defun-2009-multicore-programming-in-haskell-now/
Basically, switch over to the straight line version once you reach a certain depth, and use +RTS -sstderr to see how many sparks are being converted, so you can determine if you're wasting work or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With