Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Having Trouble Getting requests==2.7.0 to Automatically Decompress gzip

I am trying to read a gzipped XML file that I request via requests. Everything that I have read indicates that the the uncompressing should happen automatically.

#!/usr/bin/python

from __future__ import unicode_literals
import requests

if __name__ == '__main__':

    url = 'http://rdf.dmoz.org/rdf/content.rdf.u8.gz'

    headers = {
        'Accept-Encoding': "gzip,x-gzip,deflate,sdch,compress",
        'Accept-Content': 'gzip',
        'HTTP-Connection': 'keep-alive',
        'Accept-Language': "en-US,en;q=0.8",
    }

    request_reply = requests.get(url, headers=headers)

    print request_reply.headers

    request_reply.encoding = 'utf-8'
    print request_reply.text[:200]
    print request_reply.content[:200]

The header in my first line of output looks like this:

{'content-length': '260071268', 'accept-ranges': 'bytes', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Tue, 08 Sep 2015 16:27:49 GMT', 'content-type': 'application/x-gzip'}

The next two output lines appear to be binary, where I was expecting XML text:

�Iɒ(�����~ؗool���u�rʹ�J���io�   a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO���}h�����6��·��>,aҚ>��hZ6�u��x���?y�_�.y�$�Բ
�Iɒ(�����~ؗool���u�rʹ�J���io�   a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO��}h�����6��·��>,aҚ>��hZ6�u��x���

I think part of the problem is that site-packages/requests/packages/urllib3/response.py does not recognize gzip unless the header has 'content-encoding': 'gzip'

I was able to get the results I wanted by adding 4 lines to a method in response.py like so:

    def _init_decoder(self):
        """
        Set-up the _decoder attribute if necessar.
        """
        # Note: content-encoding value should be case-insensitive, per RFC 7230
        # Section 3.2
        content_encoding = self.headers.get('content-encoding', '').lower()
        if self._decoder is None and content_encoding in self.CONTENT_DECODERS:
            self._decoder = _get_decoder(content_encoding)

        # My added code below this comment
            return
        content_type = self.headers.get('content-type', '').lower()
        if self._decoder is None and content_type == 'application/x-gzip':
            self._decoder = _get_decoder('gzip')

But, is there a better way?

like image 928
user2367072 Avatar asked Sep 08 '15 17:09

user2367072


People also ask

How do I gzip in Python?

The gzip module provides the GzipFile class, as well as the open() , compress() and decompress() convenience functions. The GzipFile class reads and writes gzip-format files, automatically compressing or decompressing the data so that it looks like an ordinary file object.

What is gzip content encoding?

Gzip is a file format and software application used on Unix and Unix-like systems to compress HTTP content before it's served to a client.

What are gzip headers?

"gzip" is often also used to refer to the gzip file format, which is: a 10-byte header, containing a magic number ( 1f 8b ), the compression method ( 08 for DEFLATE), 1-byte of header flags, a 4-byte timestamp, compression flags and the operating system ID.


1 Answers

You misunderstood. Only transport-level compression is taken care of automatically, so compression applied by the HTTP server.

You have compressed content. Since this wasn't applied just for the HTTP transport stage, requests won't remove it either.

requests communicates to the server that it accepts compressed responses by sending Accept-Encoding: gzip, deflate with every request sent. The server can then respond by compressing the whole response body and adding a Content-Encoding header indicating the compression used.

Your response has no Content-Encoding header, nor would applying compression again make sense here.

Most of the time you want to download an already compressed archive like the DMOZ RDF dataset in the compressed form, anyway. You requested a compressed archive after all. It is not the job of the requests library to decode that.

In Python 3 you can handle decoding as a stream by using the gzip module and streaming the response:

import gzip
import requests
import shutil

r = requests.get(url, stream=True)
if r.status_code == 200:
    with open(path, 'wb') as f:
        r.raw.decode_content = True  # just in case transport encoding was applied
        gzip_file = gzip.GzipFile(fileobj=r.raw)
        shutil.copyfileobj(gzip_file, f)

where you could use an RDF parser instead of copying the decompressed data to disk, of course.

Unfortunately the Python 2 implementation of the module requires a seekable file; you can create your own streaming wrapper, or by adding that _decoder attribute to the r.raw object above.

like image 164
Martijn Pieters Avatar answered Oct 23 '22 16:10

Martijn Pieters