I have a Python utility that goes over a tar.xz
file and processes each of the individual files. This is a 15MB compressed file, with 740MB of uncompressed data.
On one specific server with very limited memory, the program crashes because it runs out of memory. I used objgraph to see which objects are created. It turns out that the TarInfo
instances are not being released. The main loop is similar to this:
with tarfile.open(...) as tar:
while True:
next = tar.next()
stream = tar.extractfile(next)
process_stream()
iter+=1
if not iter%1000:
objgraph.show_growth(limit=10)
The output is very consistent:
TarInfo 2040 +1000
TarInfo 3040 +1000
TarInfo 4040 +1000
TarInfo 5040 +1000
TarInfo 6040 +1000
TarInfo 7040 +1000
TarInfo 8040 +1000
TarInfo 9040 +1000
TarInfo 10040 +1000
TarInfo 11040 +1000
TarInfo 12040 +1000
this goes on until all 30,000 files are processed.
Just to make sure, I've commented out the lines creating the stream and processing it. The memory usage remained the same - TarInfo instances are leaked.
I'm using Python 3.4.1, and this behavior is consistent on Ubuntu, OS X and Windows.
It looks like this is actually by design. The TarFile
object maintains a list of all the TarInfo
objects it contains in a members
attribute. Each time you call next
, the TarInfo
object it extracts from the archive is added to the list:
def next(self):
"""Return the next member of the archive as a TarInfo object, when
TarFile is opened for reading. Return None if there is no more
available.
"""
self._check("ra")
if self.firstmember is not None:
m = self.firstmember
self.firstmember = None
return m
# Read the next block.
self.fileobj.seek(self.offset)
tarinfo = None
... <snip>
if tarinfo is not None:
self.members.append(tarinfo) # <-- the TarInfo instance is added to members
The members
list will just keep growing as you extract more items. This enables usage of the getmembers
and getmember
methods, but is just a nuisance for your use-case. It seems the best workaround is to just keep clearing the members
attribute as you iterate (as suggested here):
with tarfile.open(...) as tar:
while True:
next = tar.next()
stream = tar.extractfile(next)
process_stream()
iter+=1
tar.members = [] # Clear members list
if not iter%1000:
objgraph.show_growth(limit=10)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With