Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I optimise garbage collection for a soft real-time application in Haskell?

I've written a soft real-time application in Haskell, which deals with simulated physics, collision detection, all that good stuff. In doing all that I allocate a lot of memory, and I could probably optimise my memory usage if I wanted to, but since I'm sitting nicely at 40% CPU and only 1% RAM used anyway, that doesn't seem necessary. What I am seeing though, is that a lot of time, when the garbage collector kicks in, frames are skipped. I've verified that this is the cause of the problem by profiling with threadscope: no useful computation happens for sometimes up to 0.05 seconds while the garbage collector does its business, resulting in up to 3 skipped frames, which is very noticeable and very annoying.

Now, I tried solving this by manually calling performMinorGC every frame, and this seems to alleviate the issue, making it much smoother, apart from the fact that overall CPU usage goes drastically up to around 70%. Clearly I'd rather avoid this.

Another thing I tried was decreasing the GC's allocation space down to 64k from 512k with -H64k, and I also tried setting -I0.03 to try to get it to collect more often. Both of these options changed the pattern of garbage collection that I saw in threadscope, but they still resulted in skipped frames.

Can anyone with some experience with GC optimisation help me out here? Am I doomed to manually calling performMinorGC and putting up with the massive performance loss resulting?

EDIT

I tried to run it for a similar amount of time in these tests, but since it's real-time there's no point at which it's 'done'.

Runtime statistics with performMinorGC every 4 frames:

     9,776,109,768 bytes allocated in the heap
     349,349,800 bytes copied during GC
      53,547,152 bytes maximum residency (14 sample(s))
      12,123,104 bytes maximum slop
             105 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15536 colls, 15536 par    3.033s   0.997s     0.0001s    0.0192s
  Gen  1        14 colls,    13 par    0.207s   0.128s     0.0092s    0.0305s

  Parallel GC work balance: 6.15% (serial 0%, perfect 100%)

  TASKS: 20 (2 bound, 13 peak workers (18 total), using -N4)

  SPARKS: 74772 (20785 converted, 0 overflowed, 0 dud, 38422 GC'd, 15565 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    9.773s  (  7.368s elapsed)
  GC      time    3.240s  (  1.126s elapsed)
  EXIT    time    0.003s  (  0.004s elapsed)
  Total   time   13.040s  (  8.499s elapsed)

  Alloc rate    1,000,283,400 bytes per MUT second

  Productivity  75.2% of total user, 115.3% of total elapsed

gc_alloc_block_sync: 29843
whitehole_spin: 0
gen[0].sync: 11
gen[1].sync: 71

With no performMinorGC

  12,316,488,144 bytes allocated in the heap
     447,495,936 bytes copied during GC
      63,556,272 bytes maximum residency (15 sample(s))
      15,418,296 bytes maximum slop
             146 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     19292 colls, 19292 par    2.613s   0.950s     0.0000s    0.0161s
  Gen  1        15 colls,    14 par    0.237s   0.165s     0.0110s    0.0499s

  Parallel GC work balance: 2.67% (serial 0%, perfect 100%)

  TASKS: 17 (2 bound, 13 peak workers (15 total), using -N4)

  SPARKS: 100714 (29688 converted, 0 overflowed, 0 dud, 47577 GC'd, 23449 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time   13.377s  (  9.917s elapsed)
  GC      time    2.850s  (  1.115s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time   16.247s  ( 11.039s elapsed)

  Alloc rate    920,744,995 bytes per MUT second

  Productivity  82.5% of total user, 121.4% of total elapsed

gc_alloc_block_sync: 68533
whitehole_spin: 0
gen[0].sync: 9
gen[1].sync: 147

The overall productivity seems to be lower for no performMinorGC now than when I tested it yesterday for some reason -- before it was always >90%.

like image 738
Bradley Hardy Avatar asked Nov 05 '15 13:11

Bradley Hardy


People also ask

How can I make my garbage collection faster?

Short of avoiding garbage collection altogether, there is only one way to make garbage collection faster: ensure that as few objects as possible are reachable during the garbage collection. The fewer objects that are alive, the less there is to be marked. This is the rationale behind the generational heap.

Does Haskell use garbage collection?

The Haskell runtime system employs a generational garbage collector (GC) with two generations 2. Generations are numbered starting with the youngest generation at zero. Values are always allocated in a special part of the youngest generation called the nursery.

How does garbage collection impact on application performance?

An application that spends 1% of its execution time on garbage collection will loose more than 20% throughput on a 32-processor system. If we increase the GC time to 2%, the overall throughput will drop by another 20%. Such is the impact of suspending 32 executing threads simultaneously!

What is real time garbage collection?

Real-time Garbage Collection feedback What we accomplished: Garbage collection reclaims unused memory in a program automatically without the programmer having to manage it.

Why does Haskell need garbage collection?

Haskell computations produce a lot of memory garbage - much more than conventional imperative languages. It's because data are immutable so the only way to store every next operation's result is to create new values.


1 Answers

You have vary large old generation. It is like 100Mb large.

By default GHC performs major GC when heap size reaches 2x of its size after the last major GC. It means that at some point GC had to scan and copy 50Mb of data. If your processor has 10Gb memory throughput limit, then loading and copying 50Mb will take at least 0.01sec (compare with gen1 average and max pause.)

(I'm assuming that you checked eventlog to ensure major GC actually working during 0.05sec pause. So it is not an issue with thread synchronization, when GC is waiting for other threads instead of doing real work.)

So to minimize GC pauses, you should ensure old generation is small. If most of this 50Mb is a static data allocated at the very beginning and live till the end (e.g. textures or meshes,) then you are stuck. The only workaround I know is to pack the data into e.g. storable vector and unpack parts of it again when you need it.

If the data is allocated during execution and lives limited amount of time (but enough to survive few major generations,) then try to rethink your pipeline. Usually no data should survive one frame, so your are doing something wrong. E.g. you retain data when you should not.

Other bad sign -- gen0 max pause 0.02sec. It is pretty strange. By default gen0 allocation area is 0.5Mb, so gen0 GC should be fast. Probably you have large remembered set. Possible cause: mutable structures (IORef, mutable Vector etc) or a lot of lazy thunk updates.

And a minor (probably unrelated) issue: looks like you are using implicit parallelism, but only 1/3 of sparks are converted. You are allocating too many sparts, 1/2 of them are GC'd.

like image 93
Yuras Avatar answered Sep 27 '22 19:09

Yuras