Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: garbage collection fails?

Consider the following script:

l = [i for i in range(int(1e8))]
l = []
import gc
gc.collect()
# 0
gc.get_referrers(l)
# [{'__builtins__': <module '__builtin__' (built-in)>, 'l': [], '__package__': None, 'i': 99999999, 'gc': <module 'gc' (built-in)>, '__name__': '__main__', '__doc__': None}]
del l
gc.collect()
# 0

The point is, after all these steps the memory usage of this python process is around 30 % on my machine (Python 2.6.5, any more details on request?). Here's an excerpt of the output of top:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
5478 moooeeeep 20   0 2397m 2.3g 3428 S    0 29.8   0:09.15 ipython  

resp. ps aux:

moooeeeep 5478  1.0 29.7 2454720 2413516 pts/2 S+   12:39   0:09 /usr/bin/python /usr/bin/ipython gctest.py

According to the docs for gc.collect:

Not all items in some free lists may be freed due to the particular implementation, in particular int and float.

Does this mean, if I (temporarily) need a large number of different int or float numbers, I need to export this to C/C++ because the Python GC fails to release the memory?


Update

Probably the interpreter is to blame, as this article suggests:

It’s that you’ve created 5 million integers simultaneously alive, and each int object consumes 12 bytes. “For speed”, Python maintains an internal free list for integer objects. Unfortunately, that free list is both immortal and unbounded in size. floats also use an immortal & unbounded free list.

The problem however remains, as I cannot avoid this amount of data (timestamp/value pairs from an external source). Am I really forced to drop Python and go back to C/C++ ?


Update 2

Probably it's indeed the case, that the Python implementation causes the problem. Found this answer conclusively explaining the issue and a possible workaround.

like image 987
moooeeeep Avatar asked Mar 08 '12 11:03

moooeeeep


People also ask

What triggers Python garbage collection?

The process by which Python periodically frees and reclaims blocks of memory that no longer are in use is called Garbage Collection. Python's garbage collector runs during program execution and is triggered when an object's reference count reaches zero.

Does Python handle garbage collection?

Python has an automated garbage collection. It has an algorithm to deallocate objects which are no longer needed. Python has two ways to delete the unused objects from the memory.

How does Pythons garbage collector work?

The garbage collector is keeping track of all objects in memory. A new object starts its life in the first generation of the garbage collector. If Python executes a garbage collection process on a generation and an object survives, it moves up into a second, older generation.

How often does Python garbage collector run?

Any time a reference count drops to zero, the object is immediately removed. 295 * deallocated immediately at that time. A full collection is triggered when the number of new objects is greater than 25% of the number of existing objects.


1 Answers

Found this also to be answered by Alex Martelli in another thread.

Unfortunately (depending on your version and release of Python) some types of objects use "free lists" which are a neat local optimization but may cause memory fragmentation, specifically by making more an more memory "earmarked" for only objects of a certain type and thereby unavailable to the "general fund".

The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

In your use case, it seems that the best way for the subprocesses to accumulate some results and yet ensure those results are available to the main process is to use semi-temporary files (by semi-temporary I mean, NOT the kind of files that automatically go away when closed, just ordinary files that you explicitly delete when you're all done with them).

Fortunately I was able to split the memory intensive work into separate chunks that enabled the interpreter to actually free the temporary memory after each iteration . I used the following wrapper to run the memory intensive function as a subprocess:

import multiprocessing

def run_as_process(func, *args):
    p = multiprocessing.Process(target=func, args=args)
    try:
        p.start()
        p.join()
    finally:
        p.terminate()
like image 195
moooeeeep Avatar answered Nov 11 '22 19:11

moooeeeep