I am currently learning Haskell and I am curious about the following:
If I add an element to a list in Haskell, Haskell returns a (completely?) new list, and doesn't manipulate the original one.
Now let's say I have a list of a million elements and I append one element at the end. Does Haskell "copy" the whole list (1 million elements) and adds the element to that copy? Or is there a neat "trick" going on behind the scenes to avoid copying the whole list?
And if there isn't a "trick", is the process of copying large lists not as expensive as I think it is?
This is a surprisingly complex question, because of two features of Haskell and GHC:
List fusion means that in some situations, GHC can rewrite list processing code into a loop that doesn't allocate list cells. So depending on the context where it is used, the same code could incur no additional cost.
Lazy evaluation means that if the result of an operation is not consumed, then you don't pay the cost of computing it. So for example, this is cheap, because you only have to construct the first ten elements of the list:
example = take 10 ([1..1000000] ++ [1000001])
In fact, in that code the take 10
can fuse with the list append, so it's the same as just [1..10]
.
But let's just assume that we're consuming all of the elements of all of the lists that we make, and that the compiler isn't fusing our list operations. Now to your questions:
If I add an element to a List in Haskell, Haskell returns a (completly?) new list, and doesn't manipulate the original one. Now let's say I have a List of a million elements and I append one element at the end. Does Haskell "copy" the whole list (1 million elements) and adds the element to that copy? Or is there a neat "trick" going on behind the scenes to avoid copying the whole list?
There exist tricks to avoid copying the whole list, but by appending to its end you defeat them. The thing to understand is that functional data structures are normally designed so that operations that "modify" them will exploit structure-sharing to reuse as much of the old structure as possible. So for example, appending two lists can be defined like this:
(++) :: [a] -> [a] -> [a]
[] ++ ys = ys
(x:xs) ++ ys = x : xs ++ ys
Looking at this definition, you can tell that the list ys
will be reused in the result. So if we have xs = [1..3]
, ys = [4..5]
and xs ++ ys
, all fully evaluated and retained in memory at once, it will look something like this memory-wise:
+---+---+ +---+---+ +---+---+
xs = | 1 | -----> | 2 | -----> | 3 | -----> []
+---+---+ +---+---+ +---+---+
+---+---+ +---+---+
ys = | 4 | -----> | 5 | -----> []
+---+---+ +---+---+
^
|
+------------------------------------+
|
+---+---+ +---+---+ +---+---+ |
xs ++ ys = | 1 | -----> | 2 | -----> | 3 | -------+
+---+---+ +---+---+ +---+---+
That is the long way of saying this: if you do xs ++ ys
, and it doesn't fuse, and you consume the whole list, then that will create a copy of xs
but reuse the memory for ys
.
But now let's look again at this bit of your question:
Now let's say I have a List of a million elements and I append one element at the end. Does Haskell "copy" the whole list (1 million elements) and adds the element to that copy?
That would be something like [1..1000000] ++ [1000001]
, and yes, it would copy the whole million elements. But on the other hand, [0] ++ [1..1000000]
would only copy the [0]
. The rule of thumb is this:
The general solutions to this sort of problem are:
Data.Sequence
Data.Set
Data.Vector
It depends on the data structure you're using. If you're using normal Haskell lists, these would be analogous to a typical linked list implementation in C or C++. With this structure, appends and indexing (worst-case) are O(n) complexity, while prepends are O(1) complexity. If you are appending frequently and your list is growing linearly this will effectively be O(n^2). For large lists this is a problem. This is regardless of what language you're using, Haskell, C, C++, Python, Java, C#, or even Assembler.
However, if you were to use a structure like Data.Sequence.Seq
, then it uses the proper structure internally to provide O(1) prepends and appends, but the cost is that it can take up a bit more RAM. All data structures have tradeoffs, though, it's up to you which one you want to use.
Alternatively, you can also use Data.Vector.Vector
or Data.Array.Array
, which both provide fixed-length, contiguous memory arrays, but appending and prepending is expensive because you have to copy the entire array to a new location in RAM. Indexing is O(1), though, and mapping or folding over one of these structures would be much faster because chunks of the array can fit into your CPU cache at a time, as opposed to linked lists or sequences that have elements scattered all over your RAM.
Does Haskell "copy" the whole list (1 million elements) and adds the element to that copy?
Not necessarily, the compiler can determine if it's safe to just have the last value's next
pointer change to point at the new value instead of the empty list, or if it's unsafe it may be necessary to copy the entire list. These problems are inherent to the data structure, not the language, though. In general, I would say that Haskell's lists are better than C linked lists because the compiler is more capable of analyzing when this is safe than a programmer is, and C compiler won't do this sort of analysis, they just do exactly as they're told.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With