I've been experimenting with programming language design, and have come to the point of needing to implement a garbage collection system. Now the first thing that came to mind was reference counting, but this won't handle reference loops. Most of the pages that I come across when searching for algorithms are references on tuning the garbage collectors in existing languages, such as Java. When I do find anything describing specific algorithms, I'm not getting enough detail for implementation. For example, most of the descriptions include "when your program runs low on memory...", which isn't likely to happen anytime soon on a 4 GB system with plenty of swap. So what I'm looking for is some tutorials with good implementation details such as how to tune when to kick off the garbage collector (i.e., collect after X number of memory allocations, or every Y minutes, etc).
To give a couple more details on what I'm trying to do, I'm starting off with writing a stack-based interpreter similar to Postscript, and my next attempt will be probably an S-expression language based on one of the Lisp dialects. I am implementing in straight C. My goal is both self education, and to document the various stages into a "how to design and write an interpreter" tutorial.
As for what I've done so far, I've written a simple interpreter which implements a C style imperative language, which gets parsed and processed by a stack machine style VM (see lang2e.sourceforge.net). But this language doesn't allocate new memory on entering any function, and doesn't have any pointer data types so there wasn't really a need at the time for any type of advanced memory management. For my next project I'm thinking of starting off with reference counting for non-pointer type objects (integers, strings, etc), and then keeping track of any pointer-type object (which can generate circular references) in a separate memory pool. Then, whenever the pool grows more than X allocation units more than it was at the end of the previous garbage collection cycle, kick off the collector again.
My requirements is that it not be too inefficient, yet easy to implement and document clearly (remember, I want to develop this into a paper or book for others to follow). The algorithm I've currently got at the front is tri-color marking, but it looks like a generational collector would be a bit better, but harder to document and understand. So I'm looking for some clear reference material (preferably available online) that includes enough implementation details to get me started.
There's a great book about garbage collection. It's called Garbage Collection: Algorithms for Automatic Dynamic Memory Management, and it's excellent. I've read it, so I'm not recommending this just because you can find it with Google. Look at it here.
For simple prototyping, use mark-and-sweep or any simple non-generational, non-incremental compacting collector. Incremental collectors are good only if you need to provide for "real-time" response from your system. As long as your system is allowed to lag arbitrarily much at any particular point in time, you don't need an incremental one. Generational collectors reduce average garbage collection overhead with the expense of assuming something about the life cycles of objects.
I have implemented all (generational/non-generational, incremental/non-incremental) and debugging garbage collectors is quite hard. Because you want to focus on the design of your language, and maybe not so much on debugging a more complex garbage collector, you could stick to a simple one. I would go for mark-and-sweep most likely.
When you use garbage collection, you do not need reference counting. Throw it away.
When to kick off the allocator is probably wide open -- you could GC when a memory allocation would otherwise fail, or you could GC every time a reference is dropped, or anywhere in the middle.
Waiting until you've got no choice may mean you never GC, if the running code is fairly well contained. Or, it may introduce a gigantic pause into your environment and demolish your response time or animations or sound playback completely.
Running the full GC on every free()
could amortize the cost across more operations, though the entire system may run slower as a result. You could be more predictable, but slower overall.
If you'd like to test the thing by artificially limiting memory, you can simply run with very restrictive resource limits in place. Run ulimit -v 1024
and every process spawned by that shell will only ever have one megabyte of memory to work with.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With