Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Managing Memory with Python Reading Objects of Varying Sizes from OODB's

I'm reading in a collection of objects (tables like sqlite3 tables or dataframes) from an Object Oriented DataBase, most of which are small enough that the Python garbage collector can handle without incident. However, when they get larger in size (less than 10 MB's) the GC doesn't seem to be able to keep up.

psuedocode looks like this:

walk = walkgenerator('/path')
objs = objgenerator(walk)
with db.transaction(bundle=True, maxSize=10000, maxParts=10): 
    oldobj = None
    oldtable = None
    for obj in objs:
        currenttable = obj.table
        if oldtable and oldtable in currenttable:
            db.delete(oldobj.path)
        del oldtable
        oldtable = currenttable
        del oldobj
        oldobj = obj
        if not count % 100:
            gc.collect()

I'm looking for an elegant way to manage memory while allowing Python to handle it when possible.

Perhaps embarrassingly, I've tried using del to help clean up reference counts.

I've tried gc.collect() at varying modulo counts in my for loops:

  • 100 (no difference),
  • 1 (slows loop quite a lot, and I will still get a memory error of some type),
  • 3 (loop is still slow but memory still blows up eventually)

Suggestions are appreciated!!!

Particularly, if you can give me tools to assist with introspection. I've used Windows Task Manager here, and it seems to more or less randomly spring a memory leak. I've limited the transaction size as much as I feel comfortable, and that seems to help a little bit.

like image 761
Russia Must Remove Putin Avatar asked Nov 07 '13 06:11

Russia Must Remove Putin


People also ask

How Python objects are stored in memory?

No, they are in a different memory called “Heap Memory” (also called the Heap). To store objects, we need memory with dynamic memory allocation (i.e., size of memory and objects can change). Python interpreter actively allocates and deallocates the memory on the Heap (what C/C++ programmers should do manually!!!

How does GC work in Python?

The process by which Python periodically frees and reclaims blocks of memory that no longer are in use is called Garbage Collection. Python's garbage collector runs during program execution and is triggered when an object's reference count reaches zero.

Which is the correct way to deallocate memory in Python?

Python's memory allocation and deallocation method is automatic. The user does not have to preallocate or deallocate memory by hand as one has to when using dynamic memory allocation in languages such as C or C++. Python uses two strategies for memory allocation reference counting and garbage collection.

Does Python have garbage collection?

Python has an automated garbage collection. It has an algorithm to deallocate objects which are no longer needed. Python has two ways to delete the unused objects from the memory.


1 Answers

There's not enough info here to say much, but what I do have to say wouldn't fit in a comment so I'll post it here ;-)

First, and most importantly, in CPython garbage collection is mostly based on reference counting. gc.collect() won't do anything for you (except burn time) unless trash objects are involved in reference cycles (an object A can be reached from itself by following a chain of pointers transitively reachable from A). You create no reference cycles in the code you showed, but perhaps the database layer does.

So, after you run gc.collect(), does memory use go down at all? If not, running it is pointless.

I expect it's most likely that the database layer is holding references to objects longer than necessary, but digging into that requires digging into exact details of how the database layer is implemented.

One way to get clues is to print the result of sys.getrefcount() applied to various large objects:

>>> import sys
>>> bigobj = [1] * 1000000
>>> sys.getrefcount(bigobj)
2

As the docs say, the result is generally 1 larger than you might hope, because the refcount of getrefcount()'s argument is temporarily incremented by 1 simply because it is being used (temporarily) as an argument.

So if you see a refcount greater than 2, del won't free the object.

Another way to get clues is to pass the object to gc.get_referrers(). That returns a list of objects that directly refer to the argument (provided that a referrer participates in Python's cyclic gc).

BTW, you need to be clearer about what you mean by "doesn't seem to work" and "blows up eventually". Can't guess. What exactly goes wrong? For example, is MemoryError raised? Something else? Traebacks often yield a world of useful clues.

like image 189
Tim Peters Avatar answered Sep 28 '22 21:09

Tim Peters