Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reduce Haskell's parallelization overhead?

I am trying to understand Haskell's parallelization performance.

I have a long list (length >1000) that I am evaluating in parallel, using parallel's parMap.

Here is the full stats output using +RTS -s for a single thread (EDIT: full stats output):

        54,248,802,288 bytes allocated in the heap
           324,451,424 bytes copied during GC
             2,970,272 bytes maximum residency (4 sample(s))
                52,064 bytes maximum slop
                   217 MB total memory in use (1 MB lost due to fragmentation)

                                          Tot time (elapsed)  Avg pause  Max pause
        Gen  0       251 colls,     0 par    1.45s    1.49s     0.0059s    0.0290s
        Gen  1         4 colls,     0 par    0.03s    0.05s     0.0125s    0.0319s

        TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

        SPARKS: 6688 (0 converted, 0 overflowed, 0 dud, 1439 GC'd, 5249 fizzled)

        INIT    time    0.00s  (  0.03s elapsed)
        MUT     time   19.76s  ( 20.20s elapsed)
        GC      time    1.48s  (  1.54s elapsed)
        EXIT    time    0.00s  (  0.00s elapsed)
        Total   time   21.25s  ( 21.78s elapsed)

        Alloc rate    2,745,509,084 bytes per MUT second

        Productivity  93.0% of total user, 90.8% of total elapsed

      gc_alloc_block_sync: 0
      whitehole_spin: 0
      gen[0].sync: 0
      gen[1].sync: 0

If I run on two threads, using +RTS -N2, I get:

        54,336,738,680 bytes allocated in the heap
           346,562,320 bytes copied during GC
             5,437,368 bytes maximum residency (5 sample(s))
               120,000 bytes maximum slop
                   432 MB total memory in use (0 MB lost due to fragmentation)

                                          Tot time (elapsed)  Avg pause  Max pause
        Gen  0       127 colls,   127 par    2.07s    0.99s     0.0078s    0.0265s
        Gen  1         5 colls,     4 par    0.08s    0.04s     0.0080s    0.0118s

        Parallel GC work balance: 41.39% (serial 0%, perfect 100%)

        TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

        SPARKS: 6688 (6628 converted, 0 overflowed, 0 dud, 0 GC'd, 60 fizzled)

        INIT    time    0.00s  (  0.01s elapsed)
        MUT     time   25.31s  ( 13.35s elapsed)
        GC      time    2.15s  (  1.03s elapsed)
        EXIT    time    0.01s  (  0.01s elapsed)
        Total   time   27.48s  ( 14.40s elapsed)

        Alloc rate    2,146,509,982 bytes per MUT second

        Productivity  92.2% of total user, 175.9% of total elapsed

      gc_alloc_block_sync: 19922
      whitehole_spin: 0
      gen[0].sync: 1
      gen[1].sync: 0

and on four threads:

        54,307,370,096 bytes allocated in the heap
           367,282,056 bytes copied during GC
             8,561,960 bytes maximum residency (6 sample(s))
             3,885,784 bytes maximum slop
                   860 MB total memory in use (0 MB lost due to fragmentation)

                                          Tot time (elapsed)  Avg pause  Max pause
        Gen  0        62 colls,    62 par    2.45s    0.70s     0.0113s    0.0179s
        Gen  1         6 colls,     5 par    0.20s    0.07s     0.0112s    0.0146s

        Parallel GC work balance: 40.57% (serial 0%, perfect 100%)

        TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

        SPARKS: 6688 (6621 converted, 0 overflowed, 0 dud, 3 GC'd, 64 fizzled)

        INIT    time    0.01s  (  0.01s elapsed)
        MUT     time   37.26s  ( 10.95s elapsed)
        GC      time    2.65s  (  0.77s elapsed)
        EXIT    time    0.01s  (  0.01s elapsed)
        Total   time   39.94s  ( 11.76s elapsed)

        Alloc rate    1,457,427,453 bytes per MUT second

        Productivity  93.4% of total user, 317.2% of total elapsed

      gc_alloc_block_sync: 23494
      whitehole_spin: 0
      gen[0].sync: 10527
      gen[1].sync: 38

So according to the elapsed time (the last number in each output), with two cores the program takes ~66% of the single-threaded version, and with four cores it takes 54% of the time. This speedup is not too bad, but far worse than the theoretically expected linear improvement with the number of cores, which would result in 25% runtime with four cores.

Now, when looking at the above statistic outputs, I can see that the actual working CPU time for the program (the rows starting with MUT) increases remarkably with using more cores. With 1, 2 and 4 cores I get a CPU time of 19.76s, 25.31s and 37.26s, and this increase is what is - I believe - eating up my parallelization performance.

The typical reasons for such a CPU runtime overhead with multiple cores that come to my mind are:

  • too fine granularity of workload distribution. However, I tried the same program using parListChunked from the parallel package, with a chunk size of 10. But the result is very similar, so I do not at the moment think that the overhead is due to a too fine granularity.
  • Garbage collection: This was a big performance killer for my code in the past, but since I increased the GC size to 100Mb the total time spent in GC is quite small, as seen in the stats above.

What are other reasons for such a strong overhead, and how can I mitigate them?

like image 286
Stephan Avatar asked Mar 11 '15 08:03

Stephan


1 Answers

I see people are voting to close the question because there is not enough details, but I believe the answer can be found using the already provided information (though more details are always welcome.)

My nose tells me that you are limited by memory throughput. I'll try to describe why I think so, but I'm not a hardware expert, so I could be partially or totally wrong. After all, it is based on my personal set of hardware architecture myths.

Lets assume that the limit is somewhere in between 50-100Gb per second (I'm not sure it is correct number, please correct me if you have better one.)

You are allocating 54Gb in 10 sec (the -N4 case), so you have 5Gb/sec throughput. It is pretty high, but usually it is not an issue on itself.

Most of allocations are usually short living, and they are GC'd once the gen0 allocation area (nursery) is filled up. By default the nursery size is 512 Kb, so all the allocations occur in L2 cache. So short living data will never go into the main memory, that is why it is almost free.

But you increased nursery size to 100Mb. It will not fit the L2 cache, and will be transfered to the main memory. That is already a bad sign.

Well, 5Gb/sec is far from the limit. But there is a reason why you increased the nursery size -- your data is not short living. It will be used somewhere else after some lag. That means that this 54Gb will be loaded from the main memory back to caches sooner or later. So you have 10Gb/sec throughput at least.

It is still far from the limit, but note, that it is the best-case scenario -- sequential memory access pattern. In reality you are accessing memory in random order, so the same cache lines are loaded and unloaded multiple times, and you easily reach 100Gb/sec.

To fix the issue, you should identify why our data is not short living and try to fix that. If that is not possible, you can try to increase data locality and change memory access pattern to make it sequential.

I'd like to know what hardware experts think about my naive explanation :)

like image 130
Yuras Avatar answered Oct 23 '22 22:10

Yuras