Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

memory profiling changes memory usage (for the better)

This is actually a follow up question of this question. I managed to get the profiling to work and the problem really seems to be lazy evaluation.

The data structure I'm using is a Map Int (Map Int Text), where Text is from Data.Text. The problem is, that the function which builds this map creates a huge thunk. Working on an input text of about 3 MB the programs needs more than 250 MB of memory.

Now to the real purpose of this question:

To get the amount of characters in this data structure is use the following function:

type TextResource = M.Map Int (M.Map Int T.Text)

totalSize :: TextResouce -> Int
totalSize = M.fold ((+) . (M.fold ((+). T.length) 0)) 0

Not beautiful, but it gets the job done. I'm using this function in the main function right after the TextResource is created. The interesting thing is, that when I profile the program by using the RTS option -hr or -hc the memory usage goes down to 70 or 50 MB after a while, which would be totally fine.

Unfortunately this only works when using the profiling options and the totalSize function - without them it's back to 250 MB.

I uploaded the program (< 70 lines) together with a test file and a cabal file, so that you can try it yourself: Link

The test.xml is a generated XML file, which should be put into the executables directory. To build, cabal configure --enable-executable-profiling and afterwards cabal build should be enough (if you have the profiling versions of the required libraries installed).

You can see the change when running the program once with +RTS -hc and once without.

I'd be really great if someone could run the program, since I'm really stuck here. I already tried to put in deepseq at several places, but nothing works (well, besides using the profiling options).

Edit:

Profiling does show, however, that only ~20MB of the heap is used, so as in my comment, I blame GHC for not freeing as much of the GC nursery memory as you seem to want.

Thanks, that pointed me into the correct direction. As it turns out, you can tell GHC to perform a garbage collection (performGC), which works perfectly well, after deepseqing the map. Even though I guess the usage of performGC is not recommended, it seems to be the right tool for the job here.

Edit2: This is how I changed the main function (+ deepseqing the return of buildTextFile):

main = do tf <- buildTextFile "test.xml"
          performGC
          putStrLn . show . text 1 1000 $ tf
          getLine
          putStrLn . show . text 100 1000 $ tf
          return ()
like image 797
bzn Avatar asked Jul 29 '11 19:07

bzn


1 Answers

The problem is, that the function which builds this map creates a huge thunk.

No. Based on heap profiling I don't believe the space use is thunks. Also, I've replaced Data.Map with strict HashMaps and forced the map (to avoid creating large thunks) with the same result.

when I profile the program by using the RTS option -hr or -hc the memory usage goes down to 70 or 50 MB after a while

I can't reproduce this. With -hr, -hy, or -hc the process retains a 140MB heap. Profiling does show, however, that only ~20MB of the heap is used, so as in my comment, I blame GHC for not freeing as much of the GC nursery memory as you seem to want.

The -hy profile

As for the high memory use during computation, the above -hy profile shows most the memory is due to the String type and the HaXML library Posn type. I'll reiterate my suggestion to look for a ByteString or Text based XML library that is more resource concious (xml-enumerator?).

like image 148
Thomas M. DuBuisson Avatar answered Oct 15 '22 09:10

Thomas M. DuBuisson