Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lazy tree with a space leak

I'm writing a program trying to implement a toy XML processor. Right now the program is supposed to read a stream of events (think SAX) describing the structure of a document and to build lazily the corresponding tree.

The events are defined by the following datatype:

data Event = Open String
           | Close

A possible input would then be:

[Open "a", Open "b", Close, Open "c", Close, Close]

that would correspond to the tree:

  a
 / \
b   c

I would like to generate the tree in a lazy way, so that it does not need to be present in memory in full form at any time. My current implementation, however, seems to have a space leak causing all the nodes to be retained even when they are no longer needed. Here is the code:

data Event = Open String
           | Close

data Tree a = Tree a (Trees a)

type Trees a = [Tree a]

data Node = Node String


trees [] = []
trees (Open x : es) =
    let (children, rest) = splitStream es
    in  (Tree (Node x) (trees children)) : (trees rest)

splitStream es = scan 1 es

scan depth (s@(Open {}) : ss) =
    let (b, a) = scan (depth+1) ss
    in  (s:b, a)
scan depth (s@Close : ss) =
    case depth of
      1 -> ([], ss)
      x -> let (b, a) = scan (depth-1) ss
           in  (s:b, a)


getChildren = concatMap loop
  where
    loop (Tree _ cs) = cs


main = print .
       length .
       getChildren .
       trees $
       [Open "a"] ++ (concat . replicate 1000000 $ [Open "b", Close]) ++ [Close]

The function trees converts the list of events into a list of Tree Node. getChildren collects all the children nodes (labeled "b") of the root ("a"). These are then counted and the resulting number is printed.

The compiled program, built with GHC 7.0.4 (-O2), keeps increasing its memory usage up to the point when it prints the node count. I was expecting, on the other hand, an almost constant memory usage.

Looking at the "-hd" heap profile, it is clear that most of the memory is taken by the list constructor (:). It seems like one of the lists produced by scan or by trees is retained in full. I don't understand why, however, as length . getChildren should get rid of child nodes as soon as they are traversed.

Is there a way to fix such space leak?

like image 370
akborder Avatar asked Jul 09 '11 23:07

akborder


2 Answers

I suspect that trees is the evil guy. As John L said this is probably an instance of the Wadler Space Leak in which the compiler is unable to apply the optimization that prevents the leak. The problem is that you use a lazy pattern matching (the let expression) to deconstruct the pair and perform pattern matching via the application of trees on one of the components of the tuple. I had a quite similar problem once http://comments.gmane.org/gmane.comp.lang.haskell.glasgow.user/19129. This thread also provides a more detailed explanation. To prevent the space leak you can simply use a case expression to deconstruct the tuple as follows.

trees [] = []
trees (Open x : es) =
  case splitStream es of
       (children, rest) -> Tree (Node x) (trees children) : trees rest

With this implementation the maximum residency drops from 38MB to 28KB.

But note that this new implementation of trees is more strict than the original one as it demands the application of splitStream. Therefore, in some cases this transformation might even cause a space leak. To regain a less strict implementation you might use a similar trick as the lines function in Data.List which causes a similar problem http://hackage.haskell.org/packages/archive/base/latest/doc/html/src/Data-List.html#lines. In this case trees would look as follows.

trees [] = []
trees (Open x : es) =
  context (case splitStream es of
                (children, rest) -> (trees children, trees rest))
 where
   context ~(children', rest') = Tree (Node x) children' : rest'

If we desugar the lazy pattern matching we get the following implementation. Here the compiler is able to detect the selector to the tuple component as we do not perform pattern matching on one of the components.

trees [] = []
trees (Open x : es) = Tree (Node x) children' : rest'
 where
  (children', rest') =
     case splitStream es of
          (children, rest) -> (trees children, trees rest)

Does anybody know whether this transformation always does the trick?

like image 167
Jan Christiansen Avatar answered Nov 25 '22 13:11

Jan Christiansen


I strongly suspect this is an example of the "Wadler space leak" bug. Unfortunately I don't know how to solve it, but I did find a few things that mitigate the effects somewhat:

1) Change getChildren to

getChildren' = ($ []) . foldl (\ xsf (Tree _ cs) -> xsf . (cs ++)) id

This is a small, but noticeable, improvement.

2) In this example trees always outputs a single-element list. If this is always true for your data, explicitly dropping the rest of the list fixes the space leak:

main = print .
   length .
   getChildren .
   (:[]) .
   head .
   trees
like image 30
John L Avatar answered Nov 25 '22 13:11

John L