I am learning parallel Strategies in Haskell.
I wrote integrator function based on trapezoidal rule (sequential):
integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT f (ini, fin) dx
= let lst = map f [ini,ini+dx..fin]
in sum lst * dx - 0.5 * (f ini + f fin) * dx
And I run it as follows:
main = do
print $ (integrateT (\x -> x^4 - x^3 + x^2 + x/13 + 1) (0.0,1000000.0) 0.01 :: Double)
Here are statistics from run:
stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
18,400,147,552 bytes allocated in the heap
20,698,168 bytes copied during GC
66,688 bytes maximum residency (2 sample(s))
35,712 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17754 colls, 17754 par 0.123s 0.105s 0.0000s 0.0011s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0001s 0.0002s
Parallel GC work balance: 0.27% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 6.054s ( 5.947s elapsed)
GC time 0.123s ( 0.106s elapsed)
EXIT time 0.001s ( 0.008s elapsed)
Total time 6.178s ( 6.061s elapsed)
Alloc rate 3,039,470,269 bytes per MUT second
Productivity 98.0% of total user, 98.2% of total elapsed
gc_alloc_block_sync: 77
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
As you can see it runs pretty fast. However when I attempt to make it parallel like this for example:
integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT f (ini, fin) dx
= let lst = (map f [ini,ini+dx..fin]) `using` parListChunk 100 rdeepseq
in sum lst * dx - 0.5 * (f ini + f fin) * dx
It always runs much longer. Here are example statistics from run above:
stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
59,103,320,488 bytes allocated in the heap
17,214,458,128 bytes copied during GC
2,787,092,160 bytes maximum residency (15 sample(s))
43,219,264 bytes maximum slop
5570 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 44504 colls, 44504 par 16.907s 10.804s 0.0002s 0.0014s
Gen 1 15 colls, 14 par 4.006s 2.991s 0.1994s 1.2954s
Parallel GC work balance: 33.60% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 1000001 (1000001 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 14.298s ( 12.392s elapsed)
GC time 20.912s ( 13.795s elapsed)
EXIT time 0.000s ( 0.003s elapsed)
Total time 35.211s ( 26.190s elapsed)
Alloc rate 4,133,806,996 bytes per MUT second
Productivity 40.6% of total user, 47.3% of total elapsed
gc_alloc_block_sync: 2304055
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1105370
Few things I see here:
I experimented a bit:
A strategy like parListChunk
only makes sense when by far most of the computational cost is in evaluating the individual list elements, compared to the overhead of building the list spine etc. in the first place. With integrateT
over a simple polynomial however, these elements are very cheap to compute and most of the cost will be in list overhead. The only reason it still runs efficiently in the sequential version is that GHC can inline/fuse most of that business away, but apparently not in the parallelised version.
Solution: let each thread run a properly fusable sequential version, i.e. divide up the interval rather than its discretised list form. Like
integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT f (ini, fin) dx
= let lst = map f [ini,ini+dx..fin]
in sum lst * dx - 0.5 * (f ini + f fin) * dx
integrateT_par :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT_par f (l,r) dx
= let chunks = [ integrateT f (l + i*wChunk, l + (i+1)*wChunk) dx
| i<-[0..nChunks-1] ]
`using` parList rdeepseq
in sum chunks
where nChunks = 100
wChunk = (r-l)/nChunks
This will then not have significantly worse memory or performance than the sequential version.
Mind, as I test it now it doesn't actually seem to run in parallel at all, but pretty sure I just missed something in the setup...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With