I'm currently trying to understand how to program in parallel in Haskell. I'm following the paper "A Tutorial on Parallel and Concurrent Programming in Haskell" by Simon Peyton Jones and Satnam Singh. The source code are as followed:
module Main where
import Control.Parallel
import System.Time
main :: IO ()
main = do
putStrLn "Starting computation....."
t0 <- getClockTime
pseq r1 (return())
t1 <- getClockTime
putStrLn ("sum: " ++ show r1)
putStrLn ("time: " ++ show (secDiff t0 t1) ++ " seconds")
putStrLn "Finish."
r1 :: Int
r1 = parSumFibEuler 38 5300
-- This is the Fibonacci number generator
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
-- Gets the euler sum
mkList :: Int -> [Int]
mkList n = [1..n-1]
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
euler :: Int -> Int
euler n = length $ filter (relprime n) (mkList n)
sumEuler :: Int -> Int
sumEuler = sum.(map euler).mkList
-- Gets the sum of Euler and Fibonacci (NORMAL)
sumFibEuler :: Int -> Int -> Int
sumFibEuler a b = fib a + sumEuler b
-- Gets the sum of Euler and Fibonacci (PARALLEL)
parSumFibEuler :: Int -> Int -> Int
parSumFibEuler a b =
f `par` (e `pseq`(f+e))
where
f = fib a
e = sumEuler b
-- Measure time
secDiff :: ClockTime -> ClockTime -> Float
secDiff (TOD secs1 psecs1) (TOD secs2 psecs2)
= fromInteger (psecs2 -psecs1) / 1e12 + fromInteger (secs2- secs1)
I compiled it with the following command:
ghc --make -threaded Main.hs
a) Ran it using 1 core:
./Main +RTS -N1
b) Ran it using 2 core:
./Main +RTS -N2
However, the one core ran 53.556sec. Whereas, the two core ran 73.401sec. I don't understand how 2 cores can actually run slower then 1 core. Maybe the message passing overhead is too big for this small program? The paper have completely different outcomes compared to mines. Following are the output details.
For 1 core:
Starting computation.....
sum: 47625790
time: 53.556335 seconds
Finish.
17,961,210,216 bytes allocated in the heap
12,595,880 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
23,904 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 0 par 2.54s 2.57s 0.0001s 0.0123s
Gen 1 3 colls, 0 par 0.00s 0.00s 0.0007s 0.0010s
Parallel GC work balance: -nan (0 / 0, ideal 1)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 53.56s) 0.00s ( 0.00s)
Task 2 (bound) : 50.49s ( 50.99s) 2.52s ( 2.57s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 50.47s ( 50.99s elapsed)
GC time 2.54s ( 2.57s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 53.02s ( 53.56s elapsed)
Alloc rate 355,810,305 bytes per MUT second
Productivity 95.2% of total user, 94.2% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
For 2 core:
Starting computation.....
sum: 47625790
time: 73.401146 seconds
Finish.
17,961,210,256 bytes allocated in the heap
12,558,088 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
195,936 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 34388 par 7.42s 4.73s 0.0001s 0.0205s
Gen 1 3 colls, 3 par 0.01s 0.00s 0.0011s 0.0017s
Parallel GC work balance: 1.00 (1432193 / 1429197, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 1.19s ( 40.26s) 16.95s ( 33.15s)
Task 1 (worker) : 0.00s ( 73.40s) 0.00s ( 0.00s)
Task 2 (bound) : 54.50s ( 68.67s) 3.66s ( 4.73s)
Task 3 (worker) : 0.00s ( 73.41s) 0.00s ( 0.00s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 68.87s ( 68.67s elapsed)
GC time 7.43s ( 4.73s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 76.31s ( 73.41s elapsed)
Alloc rate 260,751,318 bytes per MUT second
Productivity 90.3% of total user, 93.8% of total elapsed
gc_alloc_block_sync: 12254
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
r1 = sumFibEuler 38 5300
I believe that you meant
r1 = parSumFibEuler 38 5300
On my configuration (with parSumFibEuler 45 8000
and with only one run):
I suspect fib
function to be much more CPU consuming than sumEuler
. That'd explain the low improvement of -N2. There won't be some work-stealing in your situation.
With memoization, your fibonacci function would be much better but I don't think that's what you wanted to try.
EDIT: as mentioned in the comments, I think that with -N2 you have a lot of interruptions since you have two cores available.
Example on my configuration (4 cores) with sum $ parMap rdeepseq (fib) [1..40]
From here:
Be careful when using all the processors in your machine: if some of your processors are in use by other programs, this can actually harm performance rather than improve it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With