Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel numeric integrator function slower than sequential version. Why?

Tags:

haskell

I am learning parallel Strategies in Haskell.

Problem

I wrote integrator function based on trapezoidal rule (sequential):

integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = map f [ini,ini+dx..fin]
    in sum lst * dx - 0.5 * (f ini + f fin) * dx

And I run it as follows:

main = do
  print $ (integrateT (\x -> x^4 - x^3 + x^2 + x/13 + 1) (0.0,1000000.0) 0.01 :: Double)

Here are statistics from run:

stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
  18,400,147,552 bytes allocated in the heap
      20,698,168 bytes copied during GC
          66,688 bytes maximum residency (2 sample(s))
          35,712 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause
  Gen  0     17754 colls, 17754 par    0.123s   0.105s     0.0000s    0.0011s
  Gen  1         2 colls,     1 par    0.000s   0.000s     0.0001s    0.0002s

  Parallel GC work balance: 0.27% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    6.054s  (  5.947s elapsed)
  GC      time    0.123s  (  0.106s elapsed)
  EXIT    time    0.001s  (  0.008s elapsed)
  Total   time    6.178s  (  6.061s elapsed)

  Alloc rate    3,039,470,269 bytes per MUT second

  Productivity  98.0% of total user, 98.2% of total elapsed

gc_alloc_block_sync: 77
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

As you can see it runs pretty fast. However when I attempt to make it parallel like this for example:

integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = (map f [ini,ini+dx..fin]) `using` parListChunk 100 rdeepseq
    in sum lst * dx - 0.5 * (f ini + f fin) * dx

It always runs much longer. Here are example statistics from run above:

stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
  59,103,320,488 bytes allocated in the heap
  17,214,458,128 bytes copied during GC
  2,787,092,160 bytes maximum residency (15 sample(s))
  43,219,264 bytes maximum slop
        5570 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause
  Gen  0     44504 colls, 44504 par   16.907s  10.804s     0.0002s    0.0014s
  Gen  1        15 colls,    14 par    4.006s   2.991s     0.1994s    1.2954s

  Parallel GC work balance: 33.60% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 1000001 (1000001 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   14.298s  ( 12.392s elapsed)
  GC      time   20.912s  ( 13.795s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time   35.211s  ( 26.190s elapsed)

  Alloc rate    4,133,806,996 bytes per MUT second

  Productivity  40.6% of total user, 47.3% of total elapsed

gc_alloc_block_sync: 2304055
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1105370

Few things I see here:

  • Much more memory used
  • Longer time
  • A lot of time spent in GC

What else I did

I experimented a bit:

  • Used parMap, parList and custom parListChunk' function to evaluate list - every time with result much worse than sequential version
  • Used different chunk sizes - from very small like 5 to as big as half of the list's length - result was much worse than sequential version every time
  • Changed factors in main function to something very big as as x^123442 for example added more divisions instead of multiplications etc. And also I reduced domain of the problem. All to make fewer sparks but more expensive computionally each. Here I got results similar to those of sequential version (which run about 28 sec with those new powers) - parallel run finished in 31 sec
  • Tested every run with Threadscope to ensure two cores were used when expected - they were!

Question

  1. As parallel performance improved with increasing computation cost of each chunk (e.g x^12345) and decreasing number of chunks - is it that switching between chunks kills performance in case when factors are very small (e.g x^4, x^3 - fast to compute), hence sequential version is faster? Is there a way to successfully parallelize it with better performance?
  2. Why parallel version used so much memory and GC time?
  3. How to reduce time spent in GC in parallel version?
like image 752
GrayCat Avatar asked Oct 05 '18 12:10

GrayCat


1 Answers

A strategy like parListChunk only makes sense when by far most of the computational cost is in evaluating the individual list elements, compared to the overhead of building the list spine etc. in the first place. With integrateT over a simple polynomial however, these elements are very cheap to compute and most of the cost will be in list overhead. The only reason it still runs efficiently in the sequential version is that GHC can inline/fuse most of that business away, but apparently not in the parallelised version.

Solution: let each thread run a properly fusable sequential version, i.e. divide up the interval rather than its discretised list form. Like

integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = map f [ini,ini+dx..fin]
    in sum lst * dx - 0.5 * (f ini + f fin) * dx

integrateT_par :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT_par f (l,r) dx
  = let chunks = [ integrateT f (l + i*wChunk, l + (i+1)*wChunk) dx
                 | i<-[0..nChunks-1] ]
               `using` parList rdeepseq
    in sum chunks
 where nChunks = 100
       wChunk = (r-l)/nChunks

This will then not have significantly worse memory or performance than the sequential version.

Mind, as I test it now it doesn't actually seem to run in parallel at all, but pretty sure I just missed something in the setup...

like image 111
leftaroundabout Avatar answered Nov 15 '22 08:11

leftaroundabout