For example
main = do
let ls = [0..10000000]
print ls
This will create the array 'inplace', using O(1) memory.
The following edit causes the program to run out of memory while executing.
main = do
let ls = [0..10000000]
print ls
print ls
ls
in this case must be kept in memory to be printed again. It would actually be heaps more memory efficient to recalculate the array again 'inplace' than to try to keep this in place. That's an aside though. My real question is "how and when does GHC communicate to the runtime system that ls
can be destroyed while it's created in O(1) time?" I understand that liveness analysis can find this information, I'm just wondering where the information is used. Is it the garbage collector that is passed this info? Is it somehow compiled away? (If I look at the compiled core from GHC then both examples use eftInt
, so if it's a compiler artifact then it must happen at a deeper level).
edit: My question was more about finding where this optimization took place. I thought maybe it was in the GC, which was fed some information from some liveness check in the compilation step. Due to the answers so far I'm probably wrong. This is most likely then happening at some lower level before core, so cmm perhaps?
edit2: Most of the answers here assume that the GC knows that ls
is no longer referenced in the first example, and that it is referenced again in the second example. I know the basics of GC and I know that arrays are linked lists, etc. My question is exactly HOW the GC knows this. The answer could probably be only (a) it is getting extra information from the compiler, or (b) it doesn't need to know this, that this information is handled 100% by the compiler
ls
here is a lazy list, not an array. In practice, it's closer to a stream or generator in another language.
The reason the first code works fine is that it never actually has the whole list in memory. ls
is defined lazily and then consume element-by-element by print
. As print
is going along, there are no other references to the beginning of ls
so list items can be garbage collected immediately.
In theory, GHC could realize that it's more efficient to not store the list in memory between the two prints but instead recompute it. However, this is not always desirable—a lot of code is actually faster if things are only evaluated once—and, more importantly, would make the execution model even more confusing for programmers.
This explanation is probably a lie, especially because I'm making it up as I go, but that shouldn't be a problem.
The essential mistake you're making is assuming that a value is live if a variable bound to it is in scope in a live expression. This is simply wrong. A value bound to a variable is only live as a result if it is actually mentioned in a live expression.
The job of the runtime is very simple
main
.We can think of this execution as involving a couple different steps that repeat over and over:
So we start with some main
expression. From the start, the "root set" for GC consists of those names that are used in that main
expression, not the things that are in scope in that expression. If I write
foo = "Hi!"
main = print "Bye!"
then since foo
is not mentioned in main
, it is not in the root set at the beginning, and since it is not even mentioned indirectly by anything mentioned by main
, it is dead right from the start.
Now suppose we take a more interesting example:
foo = "Hi!"
bar = "Bye!"
main = print foo >> print bar
Now foo
is mentioned in main
, so it starts out live. We evaluate main
to weak head normal form to find out what to do, and we get, approximately,
(primitive operation that prints out "Hi!") >> print bar
Note that foo
is no longer mentioned, so it is dead!
Now we execute that primitive operation, printing "Hi!", and our "to do list" is reduced to
print bar
Now we evaluate that to WHNF, and get, roughly,
(primitive operation to print "Bye!")
Now bar
is dead. We print "Bye!" and exit.
Consider, now, the first program you described:
main = do
let ls = [0..10000000]
print ls
This desugars to
main =
let ls = [0..10000000]
in print ls
This is where we start. The "root set" at the beginning is everything mentioned in the in
clause of the expression. So we conceptually have ls
and print
to start out. Now we can imagine that print
, specialized to [Integer]
, looks something vaguely like the following (this is greatly simplified, and will print out the list differently, but that really doesn't matter).
print xs = case xs of
[] -> return ()
(y:ys) = printInteger y >> print ys
So when we start executing main
(What do we do now? What will we do afterwards?), we are trying to calculate print ls
. To do this, we pattern match on the first constructor of ls
, which forces ls
to be evaluated to WHNF. We find the second pattern, y:ys
, matches, so we replace print ls
with print Integer y >> print ys
, where y
points to the first element of ls
and ys
points to the thunk representing the second list constructor of ls
. But note that ls
itself is now dead! Nothing is pointing to it! So as print
forces bits of the list, the bits it has already passed become dead.
In contrast, when you have
main =
let ls = ...
in print ls >> print ls
and you start executing, you start by calculating the thing to do first (print ls
). You get
(printInteger y >> print ys) >> print ls
Everything is the same, except that the second part of the expression now points to ls
. So even though the first part will be dropping pieces of the list as it goes, the second part will keep holding on to the beginning, keeping it all live.
Let me try explaining with something a little simpler than IO
. Pretend that your program is an expression of type [Int]
, and the job of the runtime system is to print each element on its own line. So we can write
countup m n = if m == n then [] else m : countup (m+1)
main = countup 0 1000
The runtime system holds a value representing everything that it should print. Let's call the "current value" whatPrint
. The RTS needs to follow a process:
whatPrint
to main
.whatPrint
empty? If so, I'm done, and can exit the program. If not, it is a cons, printNow : whatPrint'
.printNow
and print it.whatPrint
to whatPrint'
In this model, the "root set" for garbage collection is just whatPrint
.
In a real program, we don't produce a list; we produce an IO
action. But such an action is also a lazy data structure (conceptually). You can think of >>=
, return
, and each primitive IO
operation as a constructor for IO
. Think of it as
data IO :: * -> * where
Return :: a -> IO a
Bind :: IO a -> (a -> IO b) -> IO b
PrintInt :: Int -> IO ()
ReadInt :: IO Int
...
The initial value of whatShouldIDo
is main
, but its value evolves over time. Only what it points to directly is in the root set. There is no magical analysis necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With