Download, extract and read a gzip file in Python

Tags:

python

I'd like to download, extract and iterate over a text file in Python without having to create temporary files.

basically, this pipe, but in python

curl ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz | gunzip | processing step

Here's my code:

def main():
    import urllib
    import gzip

    # Download SEED database
    print 'Downloading SEED Database'
    handle = urllib.urlopen('ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz')


    with open('SEED.fasta.gz', 'wb') as out:
        while True:
            data = handle.read(1024)
            if len(data) == 0: break
            out.write(data)

    # Extract SEED database
    handle = gzip.open('SEED.fasta.gz')
    with open('SEED.fasta', 'w') as out:
        for line in handle:
            out.write(line)

    # Filter SEED database
    pass

I don't want to use process.Popen() or anything because I want this script to be platform-independent.

The problem is that the Gzip library only accepts filenames as arguments and not handles. The reason for "piping" is that the download step only uses up ~5% CPU and it would be faster to run the extraction and processing at the same time.

EDIT: This won't work because

"Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream." - dive into python

Which is why I get the error

AttributeError: addinfourl instance has no attribute 'tell'

So how does curl url | gunzip | whatever work?

228

asked Aug 23 '10 14:08

Austin Richardson

2 Answers

Just gzip.GzipFile(fileobj=handle) and you'll be on your way -- in other words, it's not really true that "the Gzip library only accepts filenames as arguments and not handles", you just have to use the fileobj= named argument.

137

answered Sep 18 '22 22:09

Alex Martelli

I've found this question while searching for methods to download and unzip a gzip file from an URL but I didn't manage to make the accepted answer work in Python 2.7.

Here's what worked for me (adapted from here):

import urllib2
import gzip
import StringIO

def download(url):
    # Download SEED database
    out_file_path = url.split("/")[-1][:-3]
    print('Downloading SEED Database from: {}'.format(url))
    response = urllib2.urlopen(url)
    compressed_file = StringIO.StringIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)

    # Extract SEED database
    with open(out_file_path, 'w') as outfile:
        outfile.write(decompressed_file.read())

    # Filter SEED database
    # ...
    return

if __name__ == "__main__":    
    download("ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/fasta_files/RF00001.fa.gz")

I changed the target URL since the original one was dead: I just looked for a gzip file served from an ftp server like in the original question.

answered Sep 22 '22 22:09

gibbone

Related questions
                            
                                Saving huge bigram dictionary to file using pickle
                            
                                render users' equations in Python
                            
                                Creating a custom sys.stdout class?
                            
                                Django queries: how to make contains OR not_contains queries
                            
                                Stream a file to the HTTP response in Pylons
                            
                                how to make python load dylib on osx
                            
                                read a binary file (python)
                            
                                Class Decorators, Inheritance, super(), and maximum recursion
                            
                                Automatically execute commands on launching python shell
                            
                                how can I convert a dictionary to a string of keyword arguments?
                            
                                python copytree with negated ignore pattern
                            
                                How to access __init__.py variables from deeper parts of a package
                            
                                how to import a 'zip' file to my .py
                            
                                Test assertions for tuples with floats
                            
                                the best search engine written with python [closed]
                            
                                What is the python equivalent to perl "a".."azc"
                            
                                Why aren't my sqlite3 foreign keys working?
                            
                                How do I connect to a UDP port in Python?
                            
                                Is it easy to fully decompile python compiled(*.pyc) files?
                            
                                Python 2 or Python 3 as the student's first language [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With