I'm writing a program trying to implement a toy XML processor. Right now the program is supposed to read a stream of events (think SAX) describing the structure of a document and to build lazily the corresponding tree.
The events are defined by the following datatype:
data Event = Open String
| Close
A possible input would then be:
[Open "a", Open "b", Close, Open "c", Close, Close]
that would correspond to the tree:
a
/ \
b c
I would like to generate the tree in a lazy way, so that it does not need to be present in memory in full form at any time. My current implementation, however, seems to have a space leak causing all the nodes to be retained even when they are no longer needed. Here is the code:
data Event = Open String
| Close
data Tree a = Tree a (Trees a)
type Trees a = [Tree a]
data Node = Node String
trees [] = []
trees (Open x : es) =
let (children, rest) = splitStream es
in (Tree (Node x) (trees children)) : (trees rest)
splitStream es = scan 1 es
scan depth (s@(Open {}) : ss) =
let (b, a) = scan (depth+1) ss
in (s:b, a)
scan depth (s@Close : ss) =
case depth of
1 -> ([], ss)
x -> let (b, a) = scan (depth-1) ss
in (s:b, a)
getChildren = concatMap loop
where
loop (Tree _ cs) = cs
main = print .
length .
getChildren .
trees $
[Open "a"] ++ (concat . replicate 1000000 $ [Open "b", Close]) ++ [Close]
The function trees
converts the list of events into a list of Tree Node
. getChildren
collects all the children nodes (labeled "b") of the root ("a"). These are then counted and the resulting number is printed.
The compiled program, built with GHC 7.0.4 (-O2), keeps increasing its memory usage up to the point when it prints the node count. I was expecting, on the other hand, an almost constant memory usage.
Looking at the "-hd" heap profile, it is clear that most of the memory is taken by the list constructor (:
). It seems like one of the lists produced by scan
or by trees
is retained in full. I don't understand why, however, as length . getChildren
should get rid of child nodes as soon as they are traversed.
Is there a way to fix such space leak?
I suspect that trees
is the evil guy. As John L said this is probably an instance of the Wadler Space Leak in which the compiler is unable to apply the optimization that prevents the leak. The problem is that you use a lazy pattern matching (the let
expression) to deconstruct the pair and perform pattern matching via the application of trees
on one of the components of the tuple. I had a quite similar problem once http://comments.gmane.org/gmane.comp.lang.haskell.glasgow.user/19129. This thread also provides a more detailed explanation. To prevent the space leak you can simply use a case
expression to deconstruct the tuple as follows.
trees [] = []
trees (Open x : es) =
case splitStream es of
(children, rest) -> Tree (Node x) (trees children) : trees rest
With this implementation the maximum residency drops from 38MB to 28KB.
But note that this new implementation of trees
is more strict than the original one as it demands the application of splitStream
. Therefore, in some cases this transformation might even cause a space leak. To regain a less strict implementation you might use a similar trick as the lines
function in Data.List which causes a similar problem http://hackage.haskell.org/packages/archive/base/latest/doc/html/src/Data-List.html#lines. In this case trees
would look as follows.
trees [] = []
trees (Open x : es) =
context (case splitStream es of
(children, rest) -> (trees children, trees rest))
where
context ~(children', rest') = Tree (Node x) children' : rest'
If we desugar the lazy pattern matching we get the following implementation. Here the compiler is able to detect the selector to the tuple component as we do not perform pattern matching on one of the components.
trees [] = []
trees (Open x : es) = Tree (Node x) children' : rest'
where
(children', rest') =
case splitStream es of
(children, rest) -> (trees children, trees rest)
Does anybody know whether this transformation always does the trick?
I strongly suspect this is an example of the "Wadler space leak" bug. Unfortunately I don't know how to solve it, but I did find a few things that mitigate the effects somewhat:
1) Change getChildren
to
getChildren' = ($ []) . foldl (\ xsf (Tree _ cs) -> xsf . (cs ++)) id
This is a small, but noticeable, improvement.
2) In this example trees
always outputs a single-element list. If this is always true for your data, explicitly dropping the rest of the list fixes the space leak:
main = print .
length .
getChildren .
(:[]) .
head .
trees
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With