Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How the Haskell garbage collector efficiently collects trees

This code from the answer to this question copied below quite nicely takes only O(n) space to do a depth first traversal of a tree of depth n which contains O(2^n) nodes. This is very good, the garbage collector seems to be doing a good job of cleaning up the already processed tree.

But my question I have is, how? Unlike a list, where once we process the first element we can completely forget it, we can't scrap the root node after processing the first leaf node. We have to wait until the left half the tree is processed (because eventually we'll have to traverse down the right from the root). Also, as the root node points to the nodes below it, and so on, all the way down to the leaves, which would seem to imply that we wouldn't be able to collect any of the first half of a tree until we start on the second half (as all those nodes will still have references to them starting from the still live root node). This fortunately is not the case, but could someone explain how?

import Data.List (foldl')

data Tree = Tree Int Tree Tree

tree n = Tree n (tree (2 * n)) (tree (2 * n + 1))
treeOne = tree 1

depthNTree n t = go n t [] where
  go 0 (Tree x _ _) = (x:)
  go n (Tree _ l r) = go (n - 1) l . go (n - 1) r

main = do
  x <- getLine
  print . foldl' (+) 0 . filter (\x -> x `rem` 5 == 0) $ depthNTree (read x) treeOne
like image 899
Clinton Avatar asked Dec 23 '15 10:12

Clinton


1 Answers

Actually you don't hold on to the root while you descend the left subtree.

go n (Tree _ l r) = go (n - 1) l . go (n - 1) r

So the root is turned two thunks, composed together. One holds a reference to the left subtree, the other holds a reference to the right subtree. The root node itself is now garbage.

The left and right subtrees themselves are just thunks, because the tree is produces lazily, so they aren't consuming much space yet.

We're only evaluating go n (Tree _ l r) because we're evaluating depthNTree n t, which is go n t []. So we're immediately forcing the two composed go calls we just turned the root into:

(go (n - 1) l . go (n - 1) r) []
= (go (n - 1) l) ((go (n - 1) r) [])

And because this is lazily evaluated, we do the outermost call first, leaving ((go (n - 1) r) []) as a thunk (and so not generating any more of r).

Recursing into go will force l, so we do generate more of that. But then we do the same thing again one level down; again that tree node becomes garbage immediately, we generate two thunks holding the left and right sub sub trees, and then we force only the left one.

After n calls we'll be evaluating go 0 (Tree x _ _) = (x:). We've generated n pairs of thunks, and forced the n left ones, leaving the right ones in memory; because the right sub-trees are unevaluated thunks they're constant space each, and there are only n of them, so only O(n) space total. And all the tree nodes leading to this path are now unreferenced.

We actually have the outermost list constructor (and the first element of the list). Forcing more of the list will explore those right sub-tree thunks further down the composition chain being built up, but there will never be more than n of them.


Technically you have bound a reference to tree 1 in the globally scoped treeOne, so actually you could retain a reference to every node you ever produce, so you're relying on GHC noticing that treeOne is only ever used once and shouldn't be retained.

like image 116
Ben Avatar answered Oct 11 '22 22:10

Ben