I am currently working my way through the Learn you a Haskell book online, and have come to a chapter where the author is explaining that some list concatenations can be inefficient: For example
((((a ++ b) ++ c) ++ d) ++ e) ++ f
is supposedly inefficient. The solution the author comes up with is to use 'difference lists' defined as
newtype DiffList a = DiffList {getDiffList :: [a] -> [a] } instance Monoid (DiffList a) where mempty = DiffList (\xs -> [] ++ xs) (DiffList f) `mappend` (DiffList g) = DiffList (\xs -> f (g xs))
I am struggling to understand why DiffList
is more computationally efficient than a simple concatenation in some cases. Could someone explain to me in simple terms why the above example is so inefficient, and in what way the DiffList
solves this problem?
Haskell also incorporates polymorphic types---types that are universally quantified in some way over all types. Polymorphic type expressions essentially describe families of types. For example, (forall a)[a] is the family of types consisting of, for every type a, the type of lists of a.
(++) makes a new list that is the concatenation of two lists.
The ++ operator is the list concatenation operator which takes two lists as operands and "combine" them into a single list. So if you have the list [x] and the list [y] then you can concatenate them like this: [x]++[y] to get [x, y ]. Notice that : takes an element and a list while ++ takes two lists.
You can just use == on them directly. This means that lists instantiate Eq as long as the element type also instantiates Eq , which is the case for all types defined in the standard Prelude except functions and IO actions.
The problem in
((((a ++ b) ++ c) ++ d) ++ e) ++ f
is the nesting. The applications of (++)
are left-nested, and that's bad; right-nesting
a ++ (b ++ (c ++ (d ++ (e ++f))))
would not be a problem. That is because (++)
is defined as
[] ++ ys = ys (x:xs) ++ ys = x : (xs ++ ys)
so to find which equation to use, the implementation must dive into the expression tree
(++) / \ (++) f / \ (++) e / \ (++) d / \ (++) c / \ a b
until it finds out whether the left operand is empty or not. If it's not empty, its head is taken and bubbled to the top, but the tail of the left operand is left untouched, so when the next element of the concatenation is demanded, the same procedure starts again.
When the concatenations are right-nested, the left operand of (++)
is always at the top, and checking for emptiness/bubbling up the head are O(1).
But when the concatenations are left-nested, n
layers deep, to reach the first element, n
nodes of the tree must be traversed, for each element of the result (coming from the first list, n-1
for those coming from the second etc.).
Let us consider a = "hello"
in
hi = ((((a ++ b) ++ c) ++ d) ++ e) ++ f
and we want to evaluate take 5 hi
. So first, it must be checked whether
(((a ++ b) ++ c) ++ d) ++ e
is empty. For that, it must be checked whether
((a ++ b) ++ c) ++ d
is empty. For that, it must be checked whether
(a ++ b) ++ c
is empty. For that, it must be checked whether
a ++ b
is empty. For that, it must be checked whether
a
is empty. Phew. It isn't, so we can bubble up again, assembling
a ++ b = 'h':("ello" ++ b) (a ++ b) ++ c = 'h':(("ello" ++ b) ++ c) ((a ++ b) ++ c) ++ d = 'h':((("ello" ++ b) ++ c) ++ d) (((a ++ b) ++ c) ++ d) ++ e = 'h':(((("ello" ++ b) ++ c) ++ d) ++ e) ((((a ++ b) ++ c) ++ d) ++ e) ++ f = 'h':((((("ello" ++ b) ++ c) ++ d) ++ e) ++ f)
and for the 'e'
, we must repeat, and for the 'l'
s too...
Drawing a part of the tree, the bubbling up goes like this:
(++) / \ (++) c / \ 'h':"ello" b
becomes first
(++) / \ (:) c / \ 'h' (++) / \ "ello" b
and then
(:) / \ 'h' (++) / \ (++) c / \ "ello" b
all the way back to the top. The structure of the tree that becomes the right child of the top-level (:)
finally, is exactly the same as the structure of the original tree, unless the leftmost list is empty, when the
(++) / \ [] b
nodes is collapsed to just b
.
So if you have left-nested concatenations of short lists, the concatenation becomes quadratic because to get the head of the concatenation is an O(nesting-depth) operation. In general, the concatenation of a left-nested
(...((a_d ++ a_{d-1}) ++ a_{d-2}) ...) ++ a_2) ++ a_1
is O(sum [i * length a_i | i <- [1 .. d]])
to evaluate fully.
With difference lists (sans the newtype wrapper for simplicity of exposition), it's not important whether the compositions are left-nested
((((a ++) . (b ++)) . (c ++)) . (d ++)) . (e ++)
or right-nested. Once you have traversed the nesting to reach the (a ++)
, that (++)
is hoisted to the top of the expression tree, so getting at each element of a
is again O(1).
In fact, the whole composition is reassociated with difference lists, as soon as you require the first element,
((((a ++) . (b ++)) . (c ++)) . (d ++)) . (e ++) $ f
becomes
((((a ++) . (b ++)) . (c ++)) . (d ++)) $ (e ++) f (((a ++) . (b ++)) . (c ++)) $ (d ++) ((e ++) f) ((a ++) . (b ++)) $ (c ++) ((d ++) ((e ++) f)) (a ++) $ (b ++) ((c ++) ((d ++) ((e ++) f))) a ++ (b ++ (c ++ (d ++ (e ++ f))))
and after that, each list is the immediate left operand of the top-level (++)
after the preceding list has been consumed.
The important thing in that is that the prepending function (a ++)
can start producing its result without inspecting its argument, so that the reassociation from
($) / \ (.) f / \ (.) (e ++) / \ (.) (d ++) / \ (.) (c ++) / \ (a ++) (b ++)
via
($)--------- / \ (.) ($) / \ / \ (.) (d ++) (e ++) f / \ (.) (c ++) / \ (a ++) (b ++)
to
($) / \ (a ++) ($) / \ (b ++) ($) / \ (c ++) ($) / \ (d ++) ($) / \ (e ++) f
doesn't need to know anything about the composed functions of the final list f
, so it's just an O(depth)
rewriting. Then the top-level
($) / \ (a ++) stuff
becomes
(++) / \ a stuff
and all elements of a
can be obtained in one step. In this example, where we had pure left-nesting, only one rewriting is necessary. If instead of (for example) (d ++)
the function in that place had been a left-nested composition, (((g ++) . (h ++)) . (i ++)) . (j ++)
, the top-level reassociation would leave that untouched and this would be reassociated when it becomes the left operand of the top-level ($)
after all previous lists have been consumed.
The total work needed for all reassociations is O(number of lists)
, so the overall cost for the concatenation is O(number of lists + sum (map length lists))
. (That means you can bring bad performance to this too, by inserting a lot of deeply left-nested ([] ++)
.)
The
newtype DiffList a = DiffList {getDiffList :: [a] -> [a] } instance Monoid (DiffList a) where mempty = DiffList (\xs -> [] ++ xs) (DiffList f) `mappend` (DiffList g) = DiffList (\xs -> f (g xs))
just wraps that so that it is more convenient to handle abstractly.
DiffList (a ++) `mappend` DiffList (b ++) ~> DiffList ((a ++) . (b++))
Note that it is only efficient for functions that don't need to inspect their argument to start producing output, if arbitrary functions are wrapped in DiffList
s, you have no such efficiency guarantees. In particular, appending ((++ a)
, wrapped or not) can create left-nested trees of (++)
when composed right-nested, so you can create the O(n²)
concatenation behaviour with that if the DiffList
constructor is exposed.
It might help to look at the definition of concatenation:
[] ++ ys = ys (x:xs) ++ ys = x : (xs ++ ys)
As you can see, in order to concatenate two lists you need to go over the left list and create a "copy" of it, just so you can change its end (this is because you can't directly change the end of the old list, due to immutability).
If you do your concatenations in the right associative way, there is no problem. Once inserted, a list will never have to be touched again (notice how ++'s definition never inspects the list on the right) so each list element is only inserted "once" for a total time of O(N).
--This is O(n) (a ++ (b ++ (c ++ (d ++ (e ++ f)))))
However, if you do concatenation in a left associative way, then the "current" list will have to be "torn down" and "rebuild""every time you add another list fragment to the end. Each list element will be iterated over when it's inserted and whenever future fragments are appended as well! It's like the problem you get in C if you naïvely call strcat multiple times in a row.
As for difference lists, the trick is that they kind of keep an explicit "hole" at the end. When you convert a DList back to a normal list you pass it what you want to be in the hole and it will be ready to go. Normal lists, on the other hand, always plug up the hole in the end with []
so if you want to change it (when concatenating) then you need to rip open the list to get to that point.
The definition of difference lists with functions can look intimidating at first, but it's actually pretty simple. You can view them from an Object Oriented point of view by thinking of them as opaque objects "toList" method that receives the list that you should insert in the hole in the end returns the DL's internal prefix plus the tail that was provided. It's efficient because you only plug the "hole" in the end after you are done converting everything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With