Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing garbage-collection pause time in a Haskell program

People also ask

What is garbage collector pause time?

Metronome garbage collector (GC) pause time can be fine-tuned for each Java™ process. By default, the Metronome GC pauses for 3 milliseconds in each individual pause, which is known as a quantum.

Does Haskell use garbage collection?

The Haskell runtime system employs a generational garbage collector (GC) with two generations 2. Generations are numbered starting with the youngest generation at zero. Values are always allocated in a special part of the youngest generation called the nursery.


You're actually doing pretty well to have a 51ms pause time with over 200Mb of live data. The system I work on has a larger max pause time with half that amount of live data.

Your assumption is correct, the major GC pause time is directly proportional to the amount of live data, and unfortunately there's no way around that with GHC as it stands. We experimented with incremental GC in the past, but it was a research project and didn't reach the level of maturity needed to fold it into the released GHC.

One thing that we're hoping will help with this in the future is compact regions: https://phabricator.haskell.org/D1264. It's a kind of manual memory management where you compact a structure in the heap, and the GC doesn't have to traverse it. It works best for long-lived data, but perhaps it will be good enough to use for individual messages in your setting. We're aiming to have it in GHC 8.2.0.

If you're in a distributed setting and have a load-balancer of some kind there are tricks you can play to avoid taking the pause hit, you basically make sure that the load-balancer doesn't send requests to machines that are about to do a major GC, and of course make sure that the machine still completes the GC even though it isn't getting requests.


As mentioned in other answers, the garbage collector in GHC traverses live data, which means the more long-living data you store in memory, the longer GC pauses will be.

GHC 8.2

To overcome this problem partially, a feature called compact regions was introduced in GHC-8.2. It is both a feature of the GHC runtime system and a library that exposes convenient interface to work with. The compact regions feature allows putting your data into a separate place in memory and GC won't traverse it during the garbage collecting phase. So if you have a large structure you want to keep in memory, consider using compact regions. However, the compact region itself doesn't have mini garbage collector inside, it works better for append-only data structures, not something like HashMap where you also want to delete stuff. Though you can overcome this problem. For details refer to the following blog post:

  • Using compact regions (by Alexander Vershilov)

GHC 8.10

Moreover, since GHC-8.10 a new low-latency incremental garbage collector algorithm is implemented. It's an alternative GC algorithm which is not enabled by default but you can opt-in to it if you want. So you can switch the default GC to a newer one to automatically get features provided by compact regions without the need to do manual wrapping and unwrapping. However, the new GC is not a silver bullet and doesn't solve all the problems automagically, and it has its trade-offs. For benchmarks of the new GC refer to the following GitHub repository:

  • Low-latency incremental GC benchmarks

I've tried your code snippet with a ringbuffer approach using IOVector as the underlying data structure. On my system (GHC 7.10.3, same compilation options) this resulted in a reduction of max time (the metric you mentioned in your OP) by ~22%.

NB. I made two assumptions here:

  1. A mutable data structure is an okay fit for the problem (I guess message passing implies IO anyhow)
  2. Your messageId's are continuous

With some additional Int parameter and arithmetic (like when messageId's are reset back to 0 or minBound) it should then be straightforward to determine whether a certain message is still in the history and retrieve it form the corresponding index in the ringbuffer.

For your testing pleasure:

import qualified Control.Exception as Exception
import qualified Control.Monad as Monad
import qualified Data.ByteString as ByteString
import qualified Data.Map.Strict as Map

import qualified Data.Vector.Mutable as Vector

data Msg = Msg !Int !ByteString.ByteString

type Chan = Map.Map Int ByteString.ByteString

data Chan2 = Chan2
    { next          :: !Int
    , maxId         :: !Int
    , ringBuffer    :: !(Vector.IOVector ByteString.ByteString)
    }

chanSize :: Int
chanSize = 200000

message :: Int -> Msg
message n = Msg n (ByteString.replicate 1024 (fromIntegral n))


newChan2 :: IO Chan2
newChan2 = Chan2 0 0 <$> Vector.unsafeNew chanSize

pushMsg2 :: Chan2 -> Msg -> IO Chan2
pushMsg2 (Chan2 ix _ store) (Msg msgId msgContent) =
    let ix' = if ix == chanSize then 0 else ix + 1
    in Vector.unsafeWrite store ix' msgContent >> return (Chan2 ix' msgId store)

pushMsg :: Chan -> Msg -> IO Chan
pushMsg chan (Msg msgId msgContent) =
  Exception.evaluate $
    let
      inserted = Map.insert msgId msgContent chan
    in
      if chanSize < Map.size inserted
      then Map.deleteMin inserted
      else inserted

main, main1, main2 :: IO ()

main = main2

main1 = Monad.foldM_ pushMsg Map.empty (map message [1..1000000])

main2 = newChan2 >>= \c -> Monad.foldM_ pushMsg2 c (map message [1..1000000])

I have to agree with the others - if you have hard real-time constraints, then using a GC language is not ideal.

However, you might consider experimenting with other available data structures rather than just Data.Map.

I rewrote it using Data.Sequence and got some promising improvements:

msgs history length  max GC pause (ms)
===================  =================
12500                              0.7
25000                              1.4
50000                              2.8
100000                             5.4
200000                            10.9
400000                            21.8
800000                            46
1600000                           87
3200000                          175
6400000                          350

Even though you're optimising for latency, I noticed other metrics improving too. In the 200000 case, execution time drops from 1.5s to 0.2s, and total memory usage drops from 600MB to 27MB.

I should note that I cheated by tweaking the design:

  • I removed the Int from the Msg, so it's not in two places.
  • Instead of using a Map from Ints to ByteStrings, I used a Sequence of ByteStrings, and instead of one Int per message, I think it can be done with one Int for the whole Sequence. Assuming messages can't get reordered, you can use a single offset to translate which message you want to where it sits in the queue.

(I included an additional function getMsg to demonstrate that.)

{-# LANGUAGE BangPatterns #-}

import qualified Control.Exception as Exception
import qualified Control.Monad as Monad
import qualified Data.ByteString as ByteString
import Data.Sequence as S

newtype Msg = Msg ByteString.ByteString

data Chan = Chan Int (Seq ByteString.ByteString)

message :: Int -> Msg
message n = Msg (ByteString.replicate 1024 (fromIntegral n))

maxSize :: Int
maxSize = 200000

pushMsg :: Chan -> Msg -> IO Chan
pushMsg (Chan !offset sq) (Msg msgContent) =
    Exception.evaluate $
        let newSize = 1 + S.length sq
            newSq = sq |> msgContent
        in
        if newSize <= maxSize
            then Chan offset newSq
            else
                case S.viewl newSq of
                    (_ :< newSq') -> Chan (offset+1) newSq'
                    S.EmptyL -> error "Can't happen"

getMsg :: Chan -> Int -> Maybe Msg
getMsg (Chan offset sq) i_ = getMsg' (i_ - offset)
    where
    getMsg' i
        | i < 0            = Nothing
        | i >= S.length sq = Nothing
        | otherwise        = Just (Msg (S.index sq i))

main :: IO ()
main = Monad.foldM_ pushMsg (Chan 0 S.empty) (map message [1..5 * maxSize])

Well you found the limitation of languages with GC: They aren't fit for hardcore real-time systems.

You have 2 options:

1st Increase heap size and use a 2 level caching system, oldest messages are sent to disk and you keep newest messages on memory, you can do this by using OS paging. The problem, though with this solution is that paging can be expensive depending on the reading capabilities of the secondary memory unit used.

2nd Program that solution using 'C' and interface it with FFI to haskell. That way you can do your own memory management. This would be the best option as you can control the memory you need by yourself.