I was trying to experiment with parallel evaluation in Haskell, but seem to have hit a wall.
Just as an experiment, I wanted to evaluate a list of tasks that take a long time to complete. So I came up with this contrived example.
import Control.Parallel.Strategies
startNum = 800000
bigList :: [Integer]
bigList = [2042^x | x <- [startNum..startNum+10]]
main = print $ sum $ parMap rdeepseq (length . show) bigList
I compiled this with ghc -O2 -eventlog -rtsopts -threaded test.hs --make
and then ran it
twice.
$ time ./test +RTS -N1 -lf -sstderr
29128678
2,702,130,280 bytes allocated in the heap
59,409,320 bytes copied during GC
3,114,392 bytes maximum residency (68 sample(s))
1,093,600 bytes maximum slop
28 MB total memory in use (6 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3101 colls, 0 par 0.09s 0.08s 0.0000s 0.0005s
Gen 1 68 colls, 0 par 0.03s 0.03s 0.0004s 0.0009s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 11 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 11 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 10.13s ( 10.13s elapsed)
GC time 0.11s ( 0.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.25s ( 10.25s elapsed)
Alloc rate 266,683,731 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 0m10.250s
user 0m10.144s
sys 0m0.106s
$ time ./test +RTS -N4 -lf -sstderr
29128678
2,702,811,640 bytes allocated in the heap
712,017,768 bytes copied during GC
22,024,144 bytes maximum residency (67 sample(s))
6,134,968 bytes maximum slop
68 MB total memory in use (3 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1329 colls, 1329 par 2.77s 0.70s 0.0005s 0.0075s
Gen 1 67 colls, 66 par 0.11s 0.03s 0.0004s 0.0019s
Parallel GC work balance: 40.17% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 11 (11 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 51.56s ( 13.04s elapsed)
GC time 2.89s ( 0.73s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 54.45s ( 13.77s elapsed)
Alloc rate 52,423,243 bytes per MUT second
Productivity 94.7% of total user, 374.4% of total elapsed
gc_alloc_block_sync: 39520
whitehole_spin: 0
gen[0].sync: 3046
gen[1].sync: 4970
real 0m13.777s
user 0m44.362s
sys 0m10.093s
I notice a slight increase in GC time, but nothing that I would have thought the extra cores wouldn't be able to over come.
So I got out threadscope
to have a look.
This is the result for -N1
And this is the result for -N4
It seems like the sparks are able to be executed much more rapidly in the -N1 case.
My Question. Why is this not seeing the speed up I would expect from a bunch of independent tasks being executed in parallel?
This seems to have to do with the Integer operations. If you replace those with something else then you will see a speedup from parallel processing.
This code has no speed up with -N2
:
main =
let x = length . show $ 10^10000000
y = length . show $ 10^10000001 in
x `par` y `pseq` print (x + y)
Nor does this
main =
let x = (10^10000000 :: Integer) `quotRem` 123
y = (10^10000001 :: Integer) `quotRem` 123 in
x `par` y `pseq` print "ok"
But this code does have a parallel speedup:
main =
let x = length $ replicate 1000000000 'x'
y = length $ replicate 1000000001 'y' in
x `par` y `pseq` print (x + y)
I couldn't find any locking in integer-gmp
, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With