A few years ago, during a C# course I learned to write a binary tree that looked more or less like this:
data Tree a = Branch a (Tree a) (Tree a) | Leaf
I saw the benefit of it, it had its values on the branches, which allowed for quick and easy lookup and insertion of values, because it would encounter a value on the root of each branch all the way down until it hit a leaf, that held no value.
Ever since I started learning Haskell, however; I've seen numerous examples of trees that are defined like this:
data Tree a = Branch (Tree a) (Tree a) | Leaf a
That definition puzzles me. I can't see the usefulness of having data on the elements that don't branch, because it would end up leading to a tree that looks like this:
Which to me, seems like a poorly designed alternative to a List. It also makes me question the lookup time of it, since it can't asses which branch to go down to find the value it's looking for; but rather needs to go through every node to find what it's looking for.
So, can anyone shed some light on why the second version (value on leaves) is so much more prevalent in Haskell than the first version?
I think this depends on what you're trying to model and how you're trying to model it.
A tree where the internal nodes store values and the leaves are just leaves is essentially a standard binary tree (tree each leaf as NULL
and you basically have an imperative-style binary tree). If the values are stored in sorted order, you now have a binary search tree. There are many specific advantages to storing data this way, most of which transfer directly over from imperative settings.
Trees where the leaves store the data and the internal nodes are just for structure do have their advantages. For example, red/black trees support two powerful operations called split
and join
that have advantages in some circumstances. split
takes as input a key, then destructively modifies the tree to produce two trees, one of which contains all keys less than the specified input key and one containing the remaining keys. join
is, in a sense, the opposite: it takes in two trees where one tree's values are all less than the other tree's values, then fuses them together into a single tree. These operations are particularly difficult to implement on most red/black trees, but are much simpler if all the data is stored in the leaves only rather than in the internal nodes. This paper detailing an imperative implementation of red/black trees mentions that some older implementations of red/black trees used this approach for this very reason.
As another potential advantage of storing keys in the leaves, suppose that you want to implement the concatenate operation, which joins two lists together. If you don't have data in the leaves, this is as simple as
concat first second = Branch first second
This works because no data is stored in those nodes. If the data is stored in the leaves, you need to somehow move a key from one of the leaves up to the new concatenation node, which takes more time and is trickier to work with.
Finally, in some cases, you might want to store the data in the leaves because the leaves are fundamentally different from internal nodes. Consider a parse tree, for example, where the leaves store specific terminals from the parse and the internal nodes store all the nonterminals in the production. In this case, there really are two different types of nodes, so it doesn't make sense to store arbitrary data in the internal nodes.
Hope this helps!
You described a tree with data at the leaves as "a poorly designed alternative to a List."
I agree that this could be used as an alternative to a list, but it's not necessarily poorly designed! Consider the data type
data Tree t = Leaf t | Branch (Tree t) (Tree t)
You can define cons
and snoc
(append to end of list) operations -
cons :: t -> Tree t -> Tree t
cons t (Leaf s) = Branch (Leaf t) (Leaf s)
cons t (Branch l r) = Branch (cons t l) r
snoc :: Tree t -> t -> Tree t
snoc (Leaf s) t = Branch (Leaf s) (Leaf t)
snoc (Branch l r) t = Branch l (snoc r t)
These run (for roughly balanced lists) in O(log n) time where n is the length of the list. This contrasts with the standard linked list, which has O(1) cons
and O(n) snoc
operations. You can also define a constant-time append
(as in templatetypedef's answer)
append :: Tree t -> Tree t -> Tree t
append l r = Branch l r
which is O(1) for two lists of any size, whereas the standard list is O(n) where n is the length of the left argument.
In practice you would want to define slightly smarter versions of these functions which attempt to keep the tree balanced. To do this it is often useful to have some additional information at the branches, which could be done by having multiple kinds of branch (as in a red-black tree which has "red" and "black" nodes) or explicitly include additional data at the branches, as in
data Tree b a = Leaf a | Branch b (Tree b a) (Tree b a)
For example, you can support an O(1) size
operation by storing the total number of elements in both subtrees in the nodes. All of your operations on the tree become slightly more complicated since you need to correctly persist the information about subtree sizes -- in effect the work of computing the size of the tree is amortized over all the operations that construct the tree (and cleverly persisted, so that minimal work is done whenever you need to reconstruct a size later).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With