This is actually a follow up question of this question. I managed to get the profiling to work and the problem really seems to be lazy evaluation.
The data structure I'm using is a Map Int (Map Int Text)
, where Text
is from Data.Text. The problem is, that the function which builds this map creates a huge thunk. Working on an input text of about 3 MB the programs needs more than 250 MB of memory.
Now to the real purpose of this question:
To get the amount of characters in this data structure is use the following function:
type TextResource = M.Map Int (M.Map Int T.Text)
totalSize :: TextResouce -> Int
totalSize = M.fold ((+) . (M.fold ((+). T.length) 0)) 0
Not beautiful, but it gets the job done. I'm using this function in the main function right after the TextResource is created. The interesting thing is, that when I profile the program by using the RTS option -hr
or -hc
the memory usage goes down to 70 or 50 MB after a while, which would be totally fine.
Unfortunately this only works when using the profiling options and the totalSize
function - without them it's back to 250 MB.
I uploaded the program (< 70 lines) together with a test file and a cabal file, so that you can try it yourself: Link
The test.xml is a generated XML file, which should be put into the executables directory.
To build, cabal configure --enable-executable-profiling
and afterwards cabal build
should be enough (if you have the profiling versions of the required libraries installed).
You can see the change when running the program once with +RTS -hc
and once without.
I'd be really great if someone could run the program, since I'm really stuck here. I already tried to put in deepseq
at several places, but nothing works (well, besides using the profiling options).
Edit:
Profiling does show, however, that only ~20MB of the heap is used, so as in my comment, I blame GHC for not freeing as much of the GC nursery memory as you seem to want.
Thanks, that pointed me into the correct direction. As it turns out, you can tell GHC to perform a garbage collection (performGC), which works perfectly well, after deepseqing the map. Even though I guess the usage of performGC is not recommended, it seems to be the right tool for the job here.
Edit2: This is how I changed the main function (+ deepseqing the return of buildTextFile):
main = do tf <- buildTextFile "test.xml"
performGC
putStrLn . show . text 1 1000 $ tf
getLine
putStrLn . show . text 100 1000 $ tf
return ()
The problem is, that the function which builds this map creates a huge thunk.
No. Based on heap profiling I don't believe the space use is thunks. Also, I've replaced Data.Map
with strict HashMaps and forced the map (to avoid creating large thunks) with the same result.
when I profile the program by using the RTS option -hr or -hc the memory usage goes down to 70 or 50 MB after a while
I can't reproduce this. With -hr
, -hy
, or -hc
the process retains a 140MB heap. Profiling does show, however, that only ~20MB of the heap is used, so as in my comment, I blame GHC for not freeing as much of the GC nursery memory as you seem to want.
As for the high memory use during computation, the above -hy
profile shows most the memory is due to the String
type and the HaXML library Posn
type. I'll reiterate my suggestion to look for a ByteString
or Text
based XML library that is more resource concious (xml-enumerator?).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With