Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tarfile in Python: Can I untar more efficiently by extracting only some of the data?

I am ordering a huge pile landsat scenes from the USGS, which come as tar.gz archives. I am writing a simple python script to unpack them. Each archive contains 15 tiff images from 60-120 mb in size, totalling just over 2 gb. I can easily extract an entire archive with the following code:

import tarfile
fileName = "LT50250232011160-SC20140922132408.tar.gz"
tfile = tarfile.open(fileName, 'r:gz')
tfile.extractall("newfolder/")

I only actually need 6 of those 15 tiffs, identified as "bands" in the title. These are some of the larger files, so together they account for about half the data. So, I thought I could speed this process up by modifying the code as follows:

fileName = "LT50250232011160-SC20140922132408.tar.gz"
tfile = tarfile.open(fileName, 'r:gz')
membersList = tfile.getmembers()
namesList = tfile.getnames()
bandsList = [x for x, y in zip(membersList, namesList) if "band" in y]
print("extracting...")
tfile.extractall("newfolder/",members=bandsList)

However, adding a timer to both scripts reveals no significant efficiency gain of the second script (on my system, both run in about a minute on a single scene). While the extraction is somewhat faster, it seems like that gain is offset by the time it takes to figure out which files need to be extracted int he first place.

The question is, is this tradeoff inherant in what I am doing, or just the result of my code being inefficient? I'm relatively new to python and only discovered tarfile today, so it wouldn't surprise me if the latter were true, but I haven't been able to find any recommendations for efficient extraction of only part of an archive.

Thanks!

like image 249
Joe Avatar asked Sep 26 '14 20:09

Joe


1 Answers

You can do that more efficiently, by opening the tarfile as a stream.(https://docs.python.org/2/library/tarfile.html#tarfile.open)

mkdir tartest
cd tartest/
dd if=/dev/urandom of=file1 count=100 bs=1M
dd if=/dev/urandom of=file2 count=100 bs=1M
dd if=/dev/urandom of=file3 count=100 bs=1M
dd if=/dev/urandom of=file4 count=100 bs=1M
dd if=/dev/urandom of=file5 count=100 bs=1M
cd ..
tar czvf test.tgz tartest

Now read like this:

import tarfile
fileName = "test.tgz"
tfile = tarfile.open(fileName, 'r|gz')
for t in tfile:
    if "file3" in t.name: 
        f = tfile.extractfile(t)
        if f:
            print(len(f.read()))

Note the | in the open command. We only read the file3.

$ time python test.py

104857600

real    0m1.201s
user    0m0.820s
sys     0m0.377s

If I change the r|gz back to the r:gz I get:

$ time python test.py 
104857600

real    0m7.033s
user    0m6.293s
sys     0m0.730s

Roughly 5 times faster (since we have 5 equally sized files). It is so because the standard way of opening allows seeking backwards; it can only do so in a compressed tarfile by extracting (I do not know the exact reason for that). If you open as a stream, you cannot seek randomly any more but if you read sequentially, which is possible in your case, it is much faster. However, you cannot to the getnames anymore beforehand. But that is not necessary in this case.

like image 135
cronos Avatar answered Sep 23 '22 19:09

cronos