Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Profiling Two Functions That Sum Large List

I just started reading Parallel and Concurrent Programming in Haskell.

I wrote two programs that, I believe, sums up a list in 2 ways:

  • running rpar (force (sum list))
  • splitting up the list, running the above command on each list, and adding each

Here's the code:

import Control.Parallel.Strategies
import Control.DeepSeq
import System.Environment

main :: IO ()
main = do
  [n] <- getArgs
  [single, faster] !! (read n - 1)

single :: IO ()
single = print . runEval $ rpar (sum list)

faster :: IO ()
faster = print . runEval $ do
    let (as, bs) = splitAt ((length list) `div` 2) list
    res1 <- rpar (sum as)
    res2 <- rpar (sum bs)
    return (res1 + res2)


list :: [Integer]
list = [1..10000000]

Compile with parallelization enabled (-threaded)

C:\Users\k\Workspace\parallel_concurrent_haskell>ghc Sum.hs -O2 -threaded -rtsopts
[1 of 1] Compiling Main             ( Sum.hs, Sum.o )
Linking Sum.exe ...

Results of single Program

C:\Users\k\Workspace\parallel_concurrent_haskell>Sum 1 +RTS -s -N2
50000005000000
     960,065,896 bytes allocated in the heap
         363,696 bytes copied during GC
          43,832 bytes maximum residency (2 sample(s))
          57,016 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1837 colls,  1837 par    0.00s    0.01s     0.0000s    0.0007s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0002s    0.0003s

  Parallel GC work balance: 0.18% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 1 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.27s  (  0.27s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.27s  (  0.28s elapsed)

  Alloc rate    3,614,365,726 bytes per MUT second

  Productivity 100.0% of total user, 95.1% of total elapsed

gc_alloc_block_sync: 573
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

Run with faster

C:\Users\k\Workspace\parallel_concurrent_haskell>Sum 2 +RTS -s -N2
50000005000000
   1,600,100,336 bytes allocated in the heap
   1,477,564,464 bytes copied during GC
     400,027,984 bytes maximum residency (14 sample(s))
      70,377,336 bytes maximum slop
             911 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3067 colls,  3067 par    1.05s    0.68s     0.0002s    0.0021s
  Gen  1        14 colls,    13 par    1.98s    1.53s     0.1093s    0.5271s

  Parallel GC work balance: 0.00% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.38s  (  1.74s elapsed)
  GC      time    3.03s  (  2.21s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.42s  (  3.95s elapsed)

  Alloc rate    4,266,934,229 bytes per MUT second

  Productivity  11.4% of total user, 9.9% of total elapsed

gc_alloc_block_sync: 335
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

Why did single complete in 0.28 seconds, but faster (poorly named, evidently) took 3.95 seconds?

like image 542
Kevin Meredith Avatar asked Mar 16 '23 17:03

Kevin Meredith


1 Answers

I am no expert in haskell-specific profiling, but I can see several possible problems in faster. You are walking the input list at least three times: once to get its length, once for splitAt (maybe it is twice, I'm not totally sure how this is implemented), and then again to read and sum its elements. In single, the list is walked only once.

You also hold the entire list in memory at once with faster, but with single haskell can process it lazily, and GC as you go. If you look at the profiling output, you can see that faster is copying many more bytes during GC: over 3,000 times more! faster also needed 400MB of memory all at once, where single needed only 40KB at a time. So the garbage collector had a larger space to keep scanning over.

Another big issue: you allocate a ton of new cons cells in faster, to hold the two intermediate sub-lists. Even if it could all be GCed right away, this is a lot of time spent allocating. It's more expensive than just doing the addition to begin with! So even before you start adding, you are already "over budget" compared to simple.

like image 163
amalloy Avatar answered Mar 23 '23 18:03

amalloy