Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell: how to detect "lazy memory leaks"

After few hours of debugging, I realized that a very simple toy example was not efficient due to a missing ! in an expression return $ 1 + x (thanks duplode!... but how come ghc does not optimize that??). I also realized it because I was comparing it with a Python code that was quicker, but I won't always write Python code to benchmark my code...

So here is my question: is there a way to automatically detect these "lazy memory leaks", that slow down a program for no real reason? I'm still pretty bad to optimize Haskell code, and forgetting a ! is quite likely, even when you're experienced I guess.

I'm aware of:

  • the +RTS -s, but I'm not sure how to interpret it: seeing 79MB of memory for a simple program seems huge to me for example, but maybe it's not as it's what takes my current program... and for bigger programs it's impossible to just detect "lazy leaks" that way I guess as I have no idea of the amount of memory my program should take.
  • the cabal v2-run --enable-profiling mysatsolvers -- +RTS -p command, but it seems that enabling the profiler kills some optimisations done by GHC, and therefore it's hard to use these values for a real benchmark. And still, it's not clear to me how to find leaks from that output anyway.

Could you for example explain to me how I could find the "lazy leaks" in a toy program like this one?

{-# LANGUAGE DerivingVia, FlexibleInstances, ScopedTypeVariables #-} module Main where  --- It depends on the transformers, containers, and base packages. --- Optimisation seems to be important or the NoLog case will be way to long. --- $ ghc -O Main.hs  import qualified Data.Map.Strict as MapStrict import Data.Functor.Identity  import qualified Control.Monad as CM import qualified Control.Monad.State.Strict as State import qualified Data.Time as Time  -- Create a class that allows me to use the function "myTell" -- that adds a number in the writer (either the LogEntry -- or StupidLogEntry one) class Monad m => LogFunctionCalls m where   myTell :: String -> Int -> m ()  ---------- Logging disabled ---------- --- (No logging at all gives the same time so I don't put here) newtype NoLog a = NoLog { unNoLog :: a }   deriving (Functor, Applicative, Monad) via Identity  instance LogFunctionCalls NoLog where   myTell _ _ = pure ()  ---------- Logging with Map ---------- -- When logging, associate a number to each name. newtype LogEntryMap = LogEntryMap (MapStrict.Map String Int)   deriving (Eq, Show)  instance LogFunctionCalls (State.State LogEntryMap) where   myTell namefunction n = State.modify' $     \(LogEntryMap m) ->       LogEntryMap $ MapStrict.insertWith (+) namefunction n m  ---------- Logging with Int ---------- -- Don't use any Map to avoid inefficiency of Map newtype LogEntryInt = LogEntryInt Int   deriving (Eq, Show)  instance LogFunctionCalls (State.State LogEntryInt) where   myTell namefunction n = State.modify' $     \(LogEntryInt m) -> LogEntryInt $! m + n  ---------- Function to compute ---------- countNumberCalls :: (LogFunctionCalls m) => Int -> m Int countNumberCalls 0 = return 0 countNumberCalls n = do   myTell "countNumberCalls" 1   x <- countNumberCalls $! n - 1   return $ 1 + x  main :: IO () main = do   let www = 15000000   putStrLn $ "Let's start!"   --- Logging disabled   t0 <- Time.getCurrentTime   let n = unNoLog $ countNumberCalls www   putStrLn $ "Logging disabled: " ++ (show n)   t1 <- Time.getCurrentTime   print (Time.diffUTCTime t1 t0)   -- Logging with Map   let (n, LogEntryMap log) = State.runState (countNumberCalls www) (LogEntryMap MapStrict.empty)   putStrLn $ "Logging with Map: " ++ (show n)   putStrLn $ (show $ log)   t2 <- Time.getCurrentTime   print (Time.diffUTCTime t2 t1)   -- Logging with Int   let (n, LogEntryInt log) = State.runState (countNumberCalls www) (LogEntryInt 0)   putStrLn $ "Logging with Int: " ++ (show n)   putStrLn $ (show $ log)   t3 <- Time.getCurrentTime   print (Time.diffUTCTime t3 t2) 
like image 860
tobiasBora Avatar asked May 07 '20 20:05

tobiasBora


People also ask

Do memory leaks show up in Task Manager?

You can find memory leaks in Windows by looking at your PC's RAM usage or using the Performance Monitor. Check which program is eating up your computer by following these steps: Press CTRL + Shift + Esc key together to launch Task Manager. Now in the Memory tab, you can see the memory hogging processes.

What is a space leak in Haskell?

A space leak occurs when there exists a point in the computer program where it uses more memory than necessary. Hence, a space leak causes the program to use more space than one would expect. Our primary language of study would be Haskell.


1 Answers

The main method for detecting memory leaks is heap profiling. Specifically, you're looking for unexpected growth in the amount of resident (mostly heap) memory, either the maximum residency in the +RTS -s statistics output, or -- more reliably -- a characteristic "pyramid" shape over time in heap profile output generated with the +RTS -h<x> flags and the hp2ps tool.

If I run your toy program with +RTS -s, I see:

   3,281,896,520 bytes allocated in the heap    3,383,195,568 bytes copied during GC      599,346,304 bytes maximum residency (17 sample(s))        5,706,584 bytes maximum slop              571 MB total memory in use (0 MB lost due to fragmentation) 

The first line can generally be ignored. Haskell programs typically allocate a roughly constant amount of memory per second of runtime, and this allocation rate is either nearly zero (for certain, unusual programs), or 0.5-2.0 gigabytes per second. This program ran for 4 seconds and allocated 3.8 gigabytes, and that's not unusual.

The bytes copied during GC and maximum residency are concerning, though. Assuming you have a program that you expect to run in constant space (i.e., there's no ever-growing data structure whose entire contents are needed), a correctly functioning Haskell program will generally not need to copy much data during garbage collection and will tend to have a maximum residency that's a small fraction of the total bytes allocated (e.g., 100 kilobytes rather than half a gigabyte), and this won't grow substantially with the number of "iterations" of whatever it is you're testing.

You can generate a quick heap profile over time without turning on formal profiling. If you compile with the GHC flag -rtsopts, you can use:

./Toy +RTS -hT 

and then display the result graphically using the hp2ps tool:

hp2ps -c -e8in Toy.hp evince Toy.ps & 

This sort of pyramid pattern is a red flag:

enter image description here

Note that rapid linear increase in heap to the tune of hundreds of megabytes per second followed by a rapid linear collapse. This is the pattern you see when a huge lazy data structure is being needlessly built up before the entire computation is forced all at once. You see two pyramids here because both your second and third tests are exhibiting memory leaks.

As an aside, the x-axis is in "MUT seconds" (seconds the "mutator" is running, which excludes garbage collection), so that's why this is less than the actual 4 second runtime. That's actually another red flag. A Haskell program that's spending half its time garbage collecting probably isn't running correctly.

To get more detail on what's causing this heap pyramid, you'll need to compile with profiling enabled. Profiling may cause a program to run somewhat slower but doesn't normally change which optimizations are in place. However, the flag -fprof-auto (and related flags) which automatically insert cost centers have the potential of causing big performance changes (by interfering with inlining, etc.). Unfortunately, the cabal --enable-profiling flag turns on profiling (compiler flag -prof) and the flag -fprof-auto-top which automatically generates cost centers for top-level functions, so for your toy example, that substantially changes the behavior of your first test case (increasing the runtime from 0.4 seconds to 5 seconds, even with no +RTS flags). That may be the problem you're seeing with profiling affecting your results. You don't need any cost centers for several additional kinds of heap profiles, so you can add the cabal flag --profiling-detail=none to shut that off, and then your profiled program should run with timing a little slower but generally similar performance to the unprofiled version.

I don't use Cabal, but compiling with the following (which should be the equivalent of --enable-profiling --profiling-detail=none):

ghc -O2 -rtsopts -prof Toy.hs    # no -fprof-auto... 

I can run your program with profiling by data type:

./Toy +RTS -hy 

If I look at the heap profile graph:

enter image description here

this attributes most of the heap to the Int type -- this narrows my problem down to a bunch of unevaluated lazy Int calculations, which might point me in the right direction.

If I'm really having trouble narrowing things down and am feeling like a technical deep-dive, I can also run a heap profile by closure (flag -hd). This tells me that the culprits are Main.sat_s7mQ and Main.sat_s7kP for the two pyramids respectively. This looks very mysterious, but they're the names of functions in the "STG", a low-level intermediate representation of my program generated by the compiler.

If I recompile with the same flags but add -fforce-recomp -ddump-stg -dsuppress-all:

ghc -O2 -rtsopts -prof -fforce-recomp -ddump-stg -dsuppress-all Toy.hs 

this will dump the STG that contains the definitions of these two functions. (The generated identifiers can differ with small changes to code and/or compiler flags, so it's best to recompile with the STG dumped and then re-profile that executable, to make sure the identifiers match.)

If I search the STG for the first culprit, I find the definition:

sat_s7mQ =     CCCS \u []         case ww2_s7mL of {           I# y_s7mO ->               case +# [1# y_s7mO] of sat_s7mP {                 __DEFAULT -> I# [sat_s7mP];               };         }; 

Yes, this is all very technical, but this is STG-speak for the expression 1 + y, which would help me zero in on the culprit.

If you don't speak STG, you can try introducing some cost centers. For example, I tried profiling only your second test case with -fprof-auto (Cabal flag --profiling-detail=all-functions). The profile output in Toy.prof isn't that useful for memory leaks because it deals with total allocation instead of active (i.e., resident and not garbage collected) allocations over time, but you can create a heap profile by cost center by running:

./Toy +RTS -hc 

In this case, it attributes everything to a single cost center, namely (315)countNumberCalls. The "315" is the cost center number which you can look up in the Toy.prof input to find the exact source code lines, if it's not clear from the name. Anyway, this at least helps narrow down the problem to countNumberCalls.

For more complicated functions, you can sometimes narrow down the problem further by manually specifying cost centers, like so:

countNumberCalls :: (LogFunctionCalls m) => Int -> m Int countNumberCalls 0 = return 0 countNumberCalls n = do   {-# SCC "mytell_call" #-} myTell "countNumberCalls" 1   x <- {-# SCC "recursive_call" #-} countNumberCalls $! n - 1   {-# SCC "return_statment" #-} return $ {-# SCC "one_plus_x" #-} 1 + x 

This actually attributes everything to "recursive_call", so it's not that helpful.

It's not wrong, though. You actually have two memory leaks here -- the x <- countNumberCalls $! n - 1 leaks heap because x isn't forced, and the 1 + x leaks stack. You could enable the BangPatterns extension and write:

!x <- countNumebrCalls $1 n - 1 

and that would actually remove one of the memory leaks, speeding up the second case from 2.5 seconds to 1.0 seconds and dropping the maximum residency from 460 megs to 95 megs (and the bytes copied during GC from 1.5 Gigs to 73 kilobytes!). However, a heap profile would show linear growing stack accounting for pretty much all of that resident memory. Because stack isn't as well-tracked as heap, that would be more difficult to track down.

Some additional notes:

Even though the +RTS -h<x> flags are primarily for heap profiling (and are discussed as "heap profiling" options in the GHC documentation), they can technically report on other uses of resident memory besides heap including per-thread state, which includes thread state objects and stack. By default, when running a profiled binary (compiled with -prof), the +RTS -h<x> flags do not report on per-thread state including stack, but you can add the -xt flag to add it, as in +RTS -hc -xt. Due to a probable unintentional oversight, on a non-profiled binary, the +RTS -hT flag (the only -h<x> flag available) includes stack even without the -xt flag. Due to a compiler bug, the -hT flag doesn't work on profiled binaries for GHC 8.6.x and earlier, but it does work on GHC 8.8.x, and for that version, +RTS -hT includes stack on non-profiled binaries but excludes it on profiled binaries unless you also specify -xt. That's why in the examples above, "Stack" only shows up when running a heap profile on a non-profiled binary. You can add the -xt flag to see it for all the other heap profiles. Note that this "STACK" is actual stack use, rather than objects on the heap that are some how affiliated with the stack.

Black holes are primarily a mechanism for supporting concurrency. When a thread starts evaluating a thunk, it "blackholes" it (i.e., marks it as a black hole), so that if another thread comes along and wants to evaluate the same thunk, it waits on the evaluation instead of trying to re-evaluate it in parallel (which would duplicate the effort of the running thread). It's also used in the non-threaded runtime, partly because it can detect infinite loops (if a thread encounters its own black hole), but also for some more important reasons that I can't remember. For -hT, -hd, and -hy heap profiling, heap objects that have been blackholed like this will be marked as "BLACKHOLE". The limited sampling rate in the profiles above can make it a little unclear, but what's happening in your program is that a large series of Int thunks are being built in a chain, and when the value is finally forced, they are turned into a long chain of BLACKHOLEs, each of which represents a computation that's been initiated and is waiting on the next computation in the chain.

like image 120
K. A. Buhr Avatar answered Sep 29 '22 16:09

K. A. Buhr