Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Behavior of the garbage collector

I have a Django application that exhibits some strange garbage collection behavior. There is one view in particular that will just keep growing the VM size significantly every time it is called - up to a certain limit, at which point usage drops back again. The problem is that it's taking considerable time until that point is reached, and in fact the virtual machine running my app doesn't have enough memory for all FCGI processes to take as much memory as they then sometimes do.

I've spent the last two days investigating this and learning about Python garbage collection, and I think I do understand what is happening now - for the most part. When using

gc.set_debug(gc.DEBUG_STATS)

Then for a single request, I see the following output:

>>> c = django.test.Client()
>>> c.get('/the/view/')
gc: collecting generation 0...
gc: objects in each generation: 724 5748 147341
gc: done.
gc: collecting generation 0...
gc: objects in each generation: 731 6460 147341
gc: done.
[...more of the same...]    
gc: collecting generation 1...
gc: objects in each generation: 718 8577 147341
gc: done.
gc: collecting generation 0...
gc: objects in each generation: 714 0 156614
gc: done.
[...more of the same...]
gc: collecting generation 0...
gc: objects in each generation: 715 5578 156612
gc: done.

So essentially, a huge amount of objects are allocated, but are initially moved to generation 1, and when gen 1 is sweeped during the same request, they are moved to generation 2. If I do a manual gc.collect(2) afterwards, they are removed. And, as I mentioned, there also removed when the next automatic gen 2 sweep happens, which, if I understand correctly, would in this case something like every 10 requests (at this point the app needs about a 150MB).

Alright, so initially I thought that there might be some cyclic referencing going on within the processing of one request that prevents any of these objects from being collected within the handling of that request. However, I've spent hours trying to find one using pympler.muppy and objgraph, both after and by debugging inside the request processing, and there don't seem to be any. Rather, it seems the 14.000 objects or so that are created during the request are all within a reference chain to some request-global object, i.e. once the request goes away, they can be freed.

That has been my attempt at explaining it, anyway. However, if that's true and there are indeed no cycling dependencies, shouldn't the whole tree of objects be freed once whatever request object that causes them to be held goes away, without the garbage collector being involved, purely by virtue of the reference counts dropping to zero?

With that setup, here are my questions:

  • Does the above even make sense, or do I have to look for the problem elsewhere? Is it just an unfortunate accident that significant data is kept around for so long in this particular use case?

  • Is there anything I can do to avoid the issue. I already see some potential to optimize the view, but that appears to be a solution with limited scope - although I am not sure what I generic one would be, either; how advisable is it for example to call gc.collect() or gc.set_threshold() manually?

In terms of how the garbage collector itself works:

  • Do I understand correctly that an object is always moved to the next generation if a sweep looks at it and determines that the references it has are not cyclic, but can in fact be traced to a root object.

  • What happens if the gc does a, say, generation 1 sweep, and finds an object that is referenced by an object within generation 2; does it follow that relationship inside generation 2, or does it wait for a generation 2 sweep to occur before analyzing the situation?

  • When using gc.DEBUG_STATS, I care primarily about the "objects in each generation" info; however, I keep getting hundreds of "gc: 0.0740s elapsed.", "gc: 1258233035.9370s elapsed." messages; they are totally inconvenient - it takes considerable time for them to be printed out, and they make the interesting things a lot harder to find. Is there a way to get rid of them?

  • I don't suppose there is a way to do a gc.get_objects() by generation, i.e. only retrieve the objects from generation 2, for example?

like image 457
miracle2k Avatar asked Nov 16 '09 06:11

miracle2k


1 Answers

Does the above even make sense, or do I have to look for the problem elsewhere? Is it just an unfortunate accident that significant data is kept around for so long in this particular use case?

Yes, it does make sense. And yes, there are other issues worth to consider. Django uses threading.local as base for DatabaseWrapper (and some contribs use it to make request object accessible from places where it's not passed explicitly). These global objects survive requests and can keep references to objects till some other view is handled in the thread.

Is there anything I can do to avoid the issue. I already see some potential to optimize the view, but that appears to be a solution with limited scope - although I am not sure what I generic one would be, either; how advisable is it for example to call gc.collect() or gc.set_threshold() manually?

General advice (probably you know it, but anyway): avoid circular references and globals (including threading.local). Try to break cycles and clear globals when django design makes hard to avoid them. gc.get_referrers(obj) might help you to find places requiring attention. Another way it to disable garbage collector and call it manually after each request, when it's the best place to do (this will prevent objects from moving to the next generation).

I don't suppose there is a way to do a gc.get_objects() by generation, i.e. only retrieve the objects from generation 2, for example?

Unfortunately this is not possible with gc interface. But there are several ways to go. You can consider the end of list returned by gc.get_objects() only, since objects in this list are sorted by generation. You can compare the list with one returned from previous call by storing weak references to them (e.g. in WeakKeyDictionary) between calls. You can rewrite gc.get_objects() in your own C module (it's easy, mostly copy-paste programming!) since they are stored by generation internally, or even access internal structures with ctypes (requires quite deep ctypes understanding).

like image 147
Denis Otkidach Avatar answered Oct 21 '22 08:10

Denis Otkidach