Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Leaking TarInfo objects

I have a Python utility that goes over a tar.xz file and processes each of the individual files. This is a 15MB compressed file, with 740MB of uncompressed data.

On one specific server with very limited memory, the program crashes because it runs out of memory. I used objgraph to see which objects are created. It turns out that the TarInfo instances are not being released. The main loop is similar to this:

with tarfile.open(...) as tar:
    while True:
        next = tar.next()
        stream = tar.extractfile(next)
        process_stream()
        iter+=1
        if not iter%1000:
            objgraph.show_growth(limit=10)

The output is very consistent:

TarInfo     2040     +1000
TarInfo     3040     +1000
TarInfo     4040     +1000
TarInfo     5040     +1000
TarInfo     6040     +1000
TarInfo     7040     +1000
TarInfo     8040     +1000
TarInfo     9040     +1000
TarInfo    10040     +1000
TarInfo    11040     +1000
TarInfo    12040     +1000

this goes on until all 30,000 files are processed.

Just to make sure, I've commented out the lines creating the stream and processing it. The memory usage remained the same - TarInfo instances are leaked.

I'm using Python 3.4.1, and this behavior is consistent on Ubuntu, OS X and Windows.

like image 903
zmbq Avatar asked Oct 15 '14 15:10

zmbq


1 Answers

It looks like this is actually by design. The TarFile object maintains a list of all the TarInfo objects it contains in a members attribute. Each time you call next, the TarInfo object it extracts from the archive is added to the list:

def next(self):
    """Return the next member of the archive as a TarInfo object, when
       TarFile is opened for reading. Return None if there is no more
       available.
    """
    self._check("ra")
    if self.firstmember is not None:
        m = self.firstmember
        self.firstmember = None
        return m

    # Read the next block.
    self.fileobj.seek(self.offset)
    tarinfo = None
    ... <snip>

    if tarinfo is not None:
        self.members.append(tarinfo)  # <-- the TarInfo instance is added to members

The members list will just keep growing as you extract more items. This enables usage of the getmembers and getmember methods, but is just a nuisance for your use-case. It seems the best workaround is to just keep clearing the members attribute as you iterate (as suggested here):

with tarfile.open(...) as tar:
    while True:
        next = tar.next()
        stream = tar.extractfile(next)
        process_stream()
        iter+=1
        tar.members = []  # Clear members list
        if not iter%1000:
            objgraph.show_growth(limit=10)
like image 129
dano Avatar answered Sep 30 '22 19:09

dano