I'm having a bit of trouble figuring out how to reduce memory usage and GC time in a simulation running in the State
monad. Presently I have to run the compiled code with +RTS -K100M
to avoid stack space overflow, and the GC stats are pretty hideous (see below).
Here are relevant snippets of the code. Complete, working (GHC 7.4.1) code can be found at http://hpaste.org/68527.
-- Lone algebraic data type holding the simulation configuration.
data SimConfig = SimConfig {
numDimensions :: !Int -- strict
, numWalkers :: !Int -- strict
, simArray :: IntMap [Double] -- strict spine
, logP :: Seq Double -- strict spine
, logL :: Seq Double -- strict spine
, pairStream :: [(Int, Int)] -- lazy (infinite) list of random vals
, doubleStream :: [Double] -- lazy (infinite) list of random vals
} deriving Show
-- The transition kernel for the simulation.
simKernel :: State SimConfig ()
simKernel = do
config <- get
let arr = simArray config
let n = numWalkers config
let d = numDimensions config
let rstm0 = pairStream config
let rstm1 = doubleStream config
let lp = logP config
let ll = logL config
let (a, b) = head rstm0 -- uses random stream
let z0 = head . map affineTransform $ take 1 rstm1 -- uses random stream
where affineTransform a = 0.5 * (a + 1) ^ 2
let proposal = zipWith (+) r1 r2
where r1 = map (*z0) $ fromJust (IntMap.lookup a arr)
r2 = map (*(1-z0)) $ fromJust (IntMap.lookup b arr)
let logA = if val > 0 then 0 else val
where val = logP_proposal + logL_proposal - (lp `index` (a - 1)) - (ll `index` (a - 1)) + ((fromIntegral n - 1) * log z0)
logP_proposal = logPrior proposal
logL_proposal = logLikelihood proposal
let cVal = (rstm1 !! 1) <= exp logA -- uses random stream
let newConfig = SimConfig { simArray = if cVal
then IntMap.update (\_ -> Just proposal) a arr
else arr
, numWalkers = n
, numDimensions = d
, pairStream = drop 1 rstm0
, doubleStream = drop 2 rstm1
, logP = if cVal
then Seq.update (a - 1) (logPrior proposal) lp
else lp
, logL = if cVal
then Seq.update (a - 1) (logLikelihood proposal) ll
else ll
}
put newConfig
main = do
-- (some stuff omitted)
let sim = logL $ (`execState` initConfig) . replicateM 100000 $ simKernel
print sim
In terms of the heap, a profile seems to cue that the System.Random
functions, in addition to (,)
, are memory culprits. I can't include an image directly, but you can see a heap profile here: http://i.imgur.com/5LKxX.png.
I have no idea how to reduce the presence of those things any further. The random variates are generated outside the State
monad (to avoid splitting the generator on every iteration), and I believe the only instance of (,)
inside simKernel
arises when plucking a pair from the lazy list (pairStream
) that is included in the simulation configuration.
The stats, including GC, are as follows:
1,220,911,360 bytes allocated in the heap
787,192,920 bytes copied during GC
186,821,752 bytes maximum residency (10 sample(s))
1,030,400 bytes maximum slop
449 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2159 colls, 0 par 0.80s 0.81s 0.0004s 0.0283s
Gen 1 10 colls, 0 par 0.96s 1.09s 0.1094s 0.4354s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.95s ( 0.97s elapsed)
GC time 1.76s ( 1.91s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.72s ( 2.88s elapsed)
%GC time 64.9% (66.2% elapsed)
Alloc rate 1,278,074,521 bytes per MUT second
Productivity 35.1% of total user, 33.1% of total elapsed
And again, I have to bump up the maximum stack size in order to even run the simulation. I know there must be a big thunk building up somewhere.. but I can't figure out where?
How can I improve the heap/stack allocation and GC in a problem like this? How can I identify where a thunk may be building up? Is the use of the State
monad here misguided?
--
UPDATE:
I neglected to look over the output of the profiler when compiling with -fprof-auto
. Here is the head of that output:
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 58 0 0.0 0.0 100.0 100.0
main Main 117 0 0.0 0.0 100.0 100.0
main.randomList Main 147 1 62.0 55.5 62.0 55.5
main.arr Main 142 1 0.0 0.0 0.0 0.0
streamToAssocList Main 143 1 0.0 0.0 0.0 0.0
streamToAssocList.go Main 146 5 0.0 0.0 0.0 0.0
main.pairList Main 137 1 0.0 0.0 9.5 16.5
consPairStream Main 138 1 0.7 0.9 9.5 16.5
consPairStream.ys Main 140 1 4.3 7.8 4.3 7.8
consPairStream.xs Main 139 1 4.5 7.8 4.5 7.8
main.initConfig Main 122 1 0.0 0.0 0.0 0.0
logLikelihood Main 163 0 0.0 0.0 0.0 0.0
logPrior Main 161 5 0.0 0.0 0.0 0.0
main.sim Main 118 1 1.0 2.2 28.6 28.1
simKernel Main 120 0 4.8 5.1 27.6 25.8
I'm not sure how to interpret this exactly, but the lazy stream of random doubles, randomList
, makes me wince. I have no idea how that could be improved.
I've updated the hpaste with a working example. It looks like the culprits are:
SimConfig
fields: simArray
, logP
and logL
data SimConfig = SimConfig { numDimensions :: !Int -- strict , numWalkers :: !Int -- strict , simArray :: !(IntMap [Double]) -- strict spine , logP :: !(Seq Double) -- strict spine , logL :: !(Seq Double) -- strict spine , pairStream :: [(Int, Int)] -- lazy , doubleStream :: [Double] -- lazy } deriving Show
newConfig
was never evaluated in the simKernel
loop due to State
being lazy. Another alternative would be to use the strict State
monad instead.
put $! newConfig
execState ... replicateM
also builds thunks. I originally replaced this with a foldl'
and moved the execState
into the fold, but I would think swapping in replicateM_
is equivalent and easier to read:
let sim = logL $ execState (replicateM_ epochs simKernel) initConfig
-- sim = logL $ foldl' (const . execState simKernel) initConfig [1..epochs]
And a few calls to mapM .. replicate
have been replaced with replicateM
. Particularly noteworthy in consPairList
where it reduces memory usage quite a bit. There is still room for improvement but the lowest hanging fruit involves unsafeInterleaveST... so I stopped.
I have no idea if the output results are what you want:
fromList [-4.287033457733427,-1.8000404912760795,-5.581988678626085,-0.9362372340483293,-5.267791907985331]
But here are the stats:
268,004,448 bytes allocated in the heap 70,753,952 bytes copied during GC 16,014,224 bytes maximum residency (7 sample(s)) 1,372,456 bytes maximum slop 40 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 490 colls, 0 par 0.05s 0.05s 0.0001s 0.0012s Gen 1 7 colls, 0 par 0.04s 0.05s 0.0076s 0.0209s INIT time 0.00s ( 0.00s elapsed) MUT time 0.12s ( 0.12s elapsed) GC time 0.09s ( 0.10s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.21s ( 0.22s elapsed) %GC time 42.2% (45.1% elapsed) Alloc rate 2,241,514,569 bytes per MUT second Productivity 57.8% of total user, 53.7% of total elapsed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With