I am trying to read a gzipped XML file that I request via requests. Everything that I have read indicates that the the uncompressing should happen automatically. <pre class="prettyprint"><code>#!/usr/bin/python from __future__ import unicode_literals import requests if __name__ == '__main__': url = 'http://rdf.dmoz.org/rdf/content.rdf.u8.gz' headers = { 'Accept-Encoding': "gzip,x-gzip,deflate,sdch,compress", 'Accept-Content': 'gzip', 'HTTP-Connection': 'keep-alive', 'Accept-Language': "en-US,en;q=0.8", } request_reply = requests.get(url, headers=headers) print request_reply.headers request_reply.encoding = 'utf-8' print request_reply.text[:200] print request_reply.content[:200] </code></pre> The header in my first line of output looks like this: <pre class="prettyprint"><code>{'content-length': '260071268', 'accept-ranges': 'bytes', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Tue, 08 Sep 2015 16:27:49 GMT', 'content-type': 'application/x-gzip'} </code></pre> The next two output lines appear to be binary, where I was expecting XML text: <pre class="prettyprint"><code>�Iɒ(��~ؗool��u�rʹ�J��io� a2R1��ߞ|�<��_��Ҽҿ=�Z��onnz7�{JO��}h��6��·��>,aҚ>��hZ6�u��x��?y�_�.y�$�Բ �Iɒ(��~ؗool��u�rʹ�J��io� a2R1��ߞ|�<��_��Ҽҿ=�Z��onnz7�{JO��}h��6��·��>,aҚ>��hZ6�u��x�� </code></pre> I think part of the problem is that <code>site-packages/requests/packages/urllib3/response.py</code> does not recognize gzip unless the header has <code>'content-encoding': 'gzip'</code> I was able to get the results I wanted by adding 4 lines to a method in <code>response.py</code> like so: <pre class="prettyprint"><code> def _init_decoder(self): """ Set-up the _decoder attribute if necessar. """ # Note: content-encoding value should be case-insensitive, per RFC 7230 # Section 3.2 content_encoding = self.headers.get('content-encoding', '').lower() if self._decoder is None and content_encoding in self.CONTENT_DECODERS: self._decoder = _get_decoder(content_encoding) # My added code below this comment return content_type = self.headers.get('content-type', '').lower() if self._decoder is None and content_type == 'application/x-gzip': self._decoder = _get_decoder('gzip') </code></pre> But, is there a better way?

You misunderstood. Only transport-level compression is taken care of automatically, so compression applied by the HTTP server. You have compressed content. Since this wasn't applied just for the HTTP transport stage, <code>requests</code> won't remove it either. <code>requests</code> communicates to the server that it accepts compressed responses by sending <code>Accept-Encoding: gzip, deflate</code> with every request sent. The server can then respond by compressing the whole response body and adding a <code>Content-Encoding</code> header indicating the compression used. Your response has no Content-Encoding header, nor would applying compression again make sense here. Most of the time you want to download an already compressed archive like the DMOZ RDF dataset in the compressed form, anyway. You requested a compressed archive after all. It is not the job of the <code>requests</code> library to decode that. In Python 3 you can handle decoding as a stream by using the <code>gzip</code> module and streaming the response: <pre class="prettyprint"><code>import gzip import requests import shutil r = requests.get(url, stream=True) if r.status_code == 200: with open(path, 'wb') as f: r.raw.decode_content = True # just in case transport encoding was applied gzip_file = gzip.GzipFile(fileobj=r.raw) shutil.copyfileobj(gzip_file, f) </code></pre> where you could use an RDF parser instead of copying the decompressed data to disk, of course. Unfortunately the Python 2 implementation of the module requires a seekable file; you can create your own streaming wrapper, or by adding that <code>_decoder</code> attribute to the <code>r.raw</code> object above.

Having Trouble Getting requests==2.7.0 to Automatically Decompress gzip

Tags:

python

python-requests

I am trying to read a gzipped XML file that I request via requests. Everything that I have read indicates that the the uncompressing should happen automatically.

#!/usr/bin/python

from __future__ import unicode_literals
import requests

if __name__ == '__main__':

    url = 'http://rdf.dmoz.org/rdf/content.rdf.u8.gz'

    headers = {
        'Accept-Encoding': "gzip,x-gzip,deflate,sdch,compress",
        'Accept-Content': 'gzip',
        'HTTP-Connection': 'keep-alive',
        'Accept-Language': "en-US,en;q=0.8",
    }

    request_reply = requests.get(url, headers=headers)

    print request_reply.headers

    request_reply.encoding = 'utf-8'
    print request_reply.text[:200]
    print request_reply.content[:200]

The header in my first line of output looks like this:

{'content-length': '260071268', 'accept-ranges': 'bytes', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Tue, 08 Sep 2015 16:27:49 GMT', 'content-type': 'application/x-gzip'}

The next two output lines appear to be binary, where I was expecting XML text:

�Iɒ(�����~ؗool���u�rʹ�J���io�   a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO���}h�����6��·��>,aҚ>��hZ6�u��x���?y�_�.y�$�Բ
�Iɒ(�����~ؗool���u�rʹ�J���io�   a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO��}h�����6��·��>,aҚ>��hZ6�u��x���

I think part of the problem is that site-packages/requests/packages/urllib3/response.py does not recognize gzip unless the header has 'content-encoding': 'gzip'

I was able to get the results I wanted by adding 4 lines to a method in response.py like so:

    def _init_decoder(self):
        """
        Set-up the _decoder attribute if necessar.
        """
        # Note: content-encoding value should be case-insensitive, per RFC 7230
        # Section 3.2
        content_encoding = self.headers.get('content-encoding', '').lower()
        if self._decoder is None and content_encoding in self.CONTENT_DECODERS:
            self._decoder = _get_decoder(content_encoding)

        # My added code below this comment
            return
        content_type = self.headers.get('content-type', '').lower()
        if self._decoder is None and content_type == 'application/x-gzip':
            self._decoder = _get_decoder('gzip')

But, is there a better way?

928

asked Sep 08 '15 17:09

user2367072

1 Answers

You misunderstood. Only transport-level compression is taken care of automatically, so compression applied by the HTTP server.

You have compressed content. Since this wasn't applied just for the HTTP transport stage, requests won't remove it either.

requests communicates to the server that it accepts compressed responses by sending Accept-Encoding: gzip, deflate with every request sent. The server can then respond by compressing the whole response body and adding a Content-Encoding header indicating the compression used.

Your response has no Content-Encoding header, nor would applying compression again make sense here.

Most of the time you want to download an already compressed archive like the DMOZ RDF dataset in the compressed form, anyway. You requested a compressed archive after all. It is not the job of the requests library to decode that.

In Python 3 you can handle decoding as a stream by using the gzip module and streaming the response:

import gzip
import requests
import shutil

r = requests.get(url, stream=True)
if r.status_code == 200:
    with open(path, 'wb') as f:
        r.raw.decode_content = True  # just in case transport encoding was applied
        gzip_file = gzip.GzipFile(fileobj=r.raw)
        shutil.copyfileobj(gzip_file, f)

where you could use an RDF parser instead of copying the decompressed data to disk, of course.

Unfortunately the Python 2 implementation of the module requires a seekable file; you can create your own streaming wrapper, or by adding that _decoder attribute to the r.raw object above.

164

answered Oct 23 '22 16:10

Martijn Pieters

Related questions
                            
                                What happens to exceptions raised in a with statement expression?
                            
                                OverflowError: signed integer is greater than maximum when parsing date in python
                            
                                Can't call parent's method in list comprehension in child's initializer, but explicit loop works
                            
                                concatenate row values for the same index in pandas
                            
                                How to pass a class method as an argument to a function external to that class?
                            
                                Using Numba with scikit-learn
                            
                                Pass Exception to next except statement
                            
                                Python* to boost::python::object
                            
                                What is the R equivalent of pandas .resample() method?
                            
                                Two-dimensional np.digitize
                            
                                Mix two lists python
                            
                                How to raise an IndexError when slice indices are out of range?
                            
                                Why does logging.setLevel() has no effect here with Python?
                            
                                Selenium find all elements by xpath
                            
                                What is the benefit of using django.conf.urls.patterns versus a list of url in Django [duplicate]
                            
                                Can you get the instance variable name from a class? [duplicate]
                            
                                Scikit-learn tutorial documentation location
                            
                                Where does Python store the name binding of function closure?
                            
                                Python: How to convert unixtimestamp and timezone into datetime object?
                            
                                Memory usage: creating one big set vs merging many small sets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With