Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excessive garbage collection (and memory use?)

I have identified a tiny portion of a library that seems to contain a memory leak. The code below is as small as I could make it, while still producing the same results as in the real code.

import System.Random
import Control.Monad.State
import Control.Monad.Loops
import Control.DeepSeq
import Data.Int (Int64)
import qualified Data.Vector.Unboxed as U

vecLen = 2048

main = flip evalStateT (mkStdGen 13) $ do
    let k = 64
    cs <- replicateM k transform
    let sizeCs = k*2*7*vecLen*8 -- 64 samples, 2 elts per list, each of len 7*vecLen, 8 bytes per Int64
    (force cs) `seq` lift $ putStr $ "Expected to use ~ " ++ (show ((fromIntegral sizeCs) / 1000000 :: Double)) ++ " MB of memory\n"

transform :: (Monad m, RandomGen g)
           => StateT g m [U.Vector Int64]
transform = do
      e <- liftM ((U.map round) . (uncurry (U.++)) . U.unzip) $ U.replicateM (vecLen `div` 2) sample
      c1 <- U.replicateM (7*vecLen) $ state random
      return [U.concat $ replicate 7 e, c1]

sample :: (RandomGen g, Monad m) => StateT g m (Double, Double)
sample = do 
    let genUVs = liftM2 (,) (state $ randomR (-1,1)) (state $ randomR (-1,1))
        -- memory usage drops and productivity increases to about 58% if I set the guard to "False" (the real code needs a guard here)
        uvGuard (u,v) = u+v >= 2 -- False -- 
    (u,v) <- iterateWhile uvGuard genUVs
    return (u, v)

Removing any more of the code significantly improves performance, either in memory use/GC, time, or both. However, I need the to compute the code above, so the real code can't be any simpler. For example, if I make e and c1 both get values from sample, the code uses 27 MB of memory and spends 9% runtime in GC. If I make both e and c1 use state random, I use about 400MB of memory and only spend 32% of runtime in GC.

The main parameter is vecLen, which I really need around 8192. To expedite profiling, I generated all the results below with vecLen=2048, but the problem is even worse as vecLen increases.

Compiling with

ghc test -rtsopts

I get:

> ./test +RTS -sstderr
Working...
Expected to use ~ 14.680064 MB of memory
Done
   3,961,219,208 bytes allocated in the heap
   2,409,953,720 bytes copied during GC
     383,698,504 bytes maximum residency (17 sample(s))
       3,214,456 bytes maximum slop
             869 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7002 colls,     0 par    1.33s    1.32s     0.0002s    0.0034s
  Gen  1        17 colls,     0 par    1.60s    1.84s     0.1080s    0.5426s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.08s  (  2.12s elapsed)
  GC      time    2.93s  (  3.16s elapsed)
  EXIT    time    0.00s  (  0.03s elapsed)
  Total   time    5.01s  (  5.30s elapsed)

  %GC     time      58.5%  (59.5% elapsed)

  Alloc rate    1,904,312,376 bytes per MUT second

  Productivity  41.5% of total user, 39.2% of total elapsed


real    0m5.306s
user    0m5.008s
sys 0m0.252s

Profiling with -p or -h* doesn't reveal much, at least to me.

The threadscope, however, is interesting: threadscope

It looks to me like I'm blowing the heap, so GC is happening and the heap size is doubling. Indeed, when I run with -H4000M, the threadscope looks slightly more even (less double-the-work,double-the-GC), but I still spend ~60% of the overall runtime doing GC. Compiling with -O2 is even worse, with over 70% of runtime spent in GC.

Questions: 1. Why is the GC running so much? 2. Is my heap usage unexpectedly large? If so, why?

For question 2, I realize heap usage could exceed my "expected" memory usage, even by a lot. But 800MB seems excessive to me. (Is that even the number I should be looking at?)

like image 516
crockeea Avatar asked Oct 03 '13 16:10

crockeea


Video Answer


1 Answers

To being attacking a problem like this, I'll often begin by littering the code with SCC pragmas wherever I feel there may be substantial allocation. In this case, I'm suspicious of e and c1 in transform and genUVs in sample,

...

transform :: (Monad m, RandomGen g)
           => StateT g m [U.Vector Int64]
transform = do
      e <- {-# SCC e #-} liftM (U.map round . uncurry (U.++) . U.unzip) $ U.replicateM (vecLen `div` 2) sample
      c1 <- {-# SCC c1 #-} U.replicateM (7*vecLen) $ state random
      return [U.concat $ replicate 7 e, c1]

sample :: (RandomGen g, Monad m) => StateT g m (Double, Double)
sample = do 
    let genUVs = {-# SCC genUVs #-} liftM2 (,) (state $ randomR (-1,1)) (state $ randomR (-1,1))
        -- memory usage drops and productivity increases to about 58% if I set the guard to "False" (the real code needs a guard here)
        uvGuard (u,v) = u+v >= 2 -- False -- 
    (u,v) <- iterateWhile uvGuard genUVs
    return $ (u, v)

We first look with -hy to see of what type the objects in question are. This reveals a number of different types including Integer, Int32, StdGen, Int, and (,). Using -hc we can determine that almost all of these values are being allocated in c1 of transform. This is confirmed by -hr, which tells us who holds references onto these object (thereby preventing them from being garbage collected). We can further confirm that c1 is the culprit by examining the types of objects it retains with -hrc1 -hy (assuming we've annotated it with {-# SCC c1 #-}).

The fact that c1 is retaining so many objects is an indication that it isn't being evaluated when we'd like it to be. While after it is evaluated c1 is a fairly short vector, before evaluation it requires several thousand random seeds, associated closures, and likely a number of other objects.

Deepseqing c1 brings the GC time from 59% to 23% and reduces memory consumption by an order of magnitude. That is the terminal return in transform turns in to,

deepseq c1 $ return [U.concat $ replicate 7 e, c1]

After this the profile looks fairly reasonable with the largest space user being roughly 10MB of ARR_WORDS allocated in transform (as expected) followed by some tuples, likely from genUVs.

like image 104
bgamari Avatar answered Sep 20 '22 20:09

bgamari