I've written a soft real-time application in Haskell, which deals with simulated physics, collision detection, all that good stuff. In doing all that I allocate a lot of memory, and I could probably optimise my memory usage if I wanted to, but since I'm sitting nicely at 40% CPU and only 1% RAM used anyway, that doesn't seem necessary. What I am seeing though, is that a lot of time, when the garbage collector kicks in, frames are skipped. I've verified that this is the cause of the problem by profiling with threadscope
: no useful computation happens for sometimes up to 0.05 seconds while the garbage collector does its business, resulting in up to 3 skipped frames, which is very noticeable and very annoying.
Now, I tried solving this by manually calling performMinorGC
every frame, and this seems to alleviate the issue, making it much smoother, apart from the fact that overall CPU usage goes drastically up to around 70%. Clearly I'd rather avoid this.
Another thing I tried was decreasing the GC's allocation space down to 64k from 512k with -H64k, and I also tried setting -I0.03 to try to get it to collect more often. Both of these options changed the pattern of garbage collection that I saw in threadscope
, but they still resulted in skipped frames.
Can anyone with some experience with GC optimisation help me out here? Am I doomed to manually calling performMinorGC
and putting up with the massive performance loss resulting?
EDIT
I tried to run it for a similar amount of time in these tests, but since it's real-time there's no point at which it's 'done'.
Runtime statistics with performMinorGC
every 4 frames:
9,776,109,768 bytes allocated in the heap
349,349,800 bytes copied during GC
53,547,152 bytes maximum residency (14 sample(s))
12,123,104 bytes maximum slop
105 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 15536 colls, 15536 par 3.033s 0.997s 0.0001s 0.0192s
Gen 1 14 colls, 13 par 0.207s 0.128s 0.0092s 0.0305s
Parallel GC work balance: 6.15% (serial 0%, perfect 100%)
TASKS: 20 (2 bound, 13 peak workers (18 total), using -N4)
SPARKS: 74772 (20785 converted, 0 overflowed, 0 dud, 38422 GC'd, 15565 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 9.773s ( 7.368s elapsed)
GC time 3.240s ( 1.126s elapsed)
EXIT time 0.003s ( 0.004s elapsed)
Total time 13.040s ( 8.499s elapsed)
Alloc rate 1,000,283,400 bytes per MUT second
Productivity 75.2% of total user, 115.3% of total elapsed
gc_alloc_block_sync: 29843
whitehole_spin: 0
gen[0].sync: 11
gen[1].sync: 71
With no performMinorGC
12,316,488,144 bytes allocated in the heap
447,495,936 bytes copied during GC
63,556,272 bytes maximum residency (15 sample(s))
15,418,296 bytes maximum slop
146 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 19292 colls, 19292 par 2.613s 0.950s 0.0000s 0.0161s
Gen 1 15 colls, 14 par 0.237s 0.165s 0.0110s 0.0499s
Parallel GC work balance: 2.67% (serial 0%, perfect 100%)
TASKS: 17 (2 bound, 13 peak workers (15 total), using -N4)
SPARKS: 100714 (29688 converted, 0 overflowed, 0 dud, 47577 GC'd, 23449 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 13.377s ( 9.917s elapsed)
GC time 2.850s ( 1.115s elapsed)
EXIT time 0.000s ( 0.006s elapsed)
Total time 16.247s ( 11.039s elapsed)
Alloc rate 920,744,995 bytes per MUT second
Productivity 82.5% of total user, 121.4% of total elapsed
gc_alloc_block_sync: 68533
whitehole_spin: 0
gen[0].sync: 9
gen[1].sync: 147
The overall productivity seems to be lower for no performMinorGC
now than when I tested it yesterday for some reason -- before it was always >90%.
Short of avoiding garbage collection altogether, there is only one way to make garbage collection faster: ensure that as few objects as possible are reachable during the garbage collection. The fewer objects that are alive, the less there is to be marked. This is the rationale behind the generational heap.
The Haskell runtime system employs a generational garbage collector (GC) with two generations 2. Generations are numbered starting with the youngest generation at zero. Values are always allocated in a special part of the youngest generation called the nursery.
An application that spends 1% of its execution time on garbage collection will loose more than 20% throughput on a 32-processor system. If we increase the GC time to 2%, the overall throughput will drop by another 20%. Such is the impact of suspending 32 executing threads simultaneously!
Real-time Garbage Collection feedback What we accomplished: Garbage collection reclaims unused memory in a program automatically without the programmer having to manage it.
Haskell computations produce a lot of memory garbage - much more than conventional imperative languages. It's because data are immutable so the only way to store every next operation's result is to create new values.
You have vary large old generation. It is like 100Mb large.
By default GHC
performs major GC when heap size reaches 2x of its size after the last major GC. It means that at some point GC had to scan and copy 50Mb of data. If your processor has 10Gb memory throughput limit, then loading and copying 50Mb will take at least 0.01sec (compare with gen1 average and max pause.)
(I'm assuming that you checked eventlog to ensure major GC actually working during 0.05sec pause. So it is not an issue with thread synchronization, when GC is waiting for other threads instead of doing real work.)
So to minimize GC pauses, you should ensure old generation is small. If most of this 50Mb is a static data allocated at the very beginning and live till the end (e.g. textures or meshes,) then you are stuck. The only workaround I know is to pack the data into e.g. storable vector and unpack parts of it again when you need it.
If the data is allocated during execution and lives limited amount of time (but enough to survive few major generations,) then try to rethink your pipeline. Usually no data should survive one frame, so your are doing something wrong. E.g. you retain data when you should not.
Other bad sign -- gen0 max pause 0.02sec. It is pretty strange. By default gen0 allocation area is 0.5Mb, so gen0 GC should be fast. Probably you have large remembered set. Possible cause: mutable structures (IORef, mutable Vector etc) or a lot of lazy thunk updates.
And a minor (probably unrelated) issue: looks like you are using implicit parallelism, but only 1/3 of sparks are converted. You are allocating too many sparts, 1/2 of them are GC'd.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With