I am trying to read a gzipped XML file that I request via requests. Everything that I have read indicates that the the uncompressing should happen automatically.
#!/usr/bin/python
from __future__ import unicode_literals
import requests
if __name__ == '__main__':
url = 'http://rdf.dmoz.org/rdf/content.rdf.u8.gz'
headers = {
'Accept-Encoding': "gzip,x-gzip,deflate,sdch,compress",
'Accept-Content': 'gzip',
'HTTP-Connection': 'keep-alive',
'Accept-Language': "en-US,en;q=0.8",
}
request_reply = requests.get(url, headers=headers)
print request_reply.headers
request_reply.encoding = 'utf-8'
print request_reply.text[:200]
print request_reply.content[:200]
The header in my first line of output looks like this:
{'content-length': '260071268', 'accept-ranges': 'bytes', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Tue, 08 Sep 2015 16:27:49 GMT', 'content-type': 'application/x-gzip'}
The next two output lines appear to be binary, where I was expecting XML text:
�Iɒ(�����~ؗool���u�rʹ�J���io� a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO���}h�����6��·��>,aҚ>��hZ6�u��x���?y�_�.y�$�Բ
�Iɒ(�����~ؗool���u�rʹ�J���io� a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO��}h�����6��·��>,aҚ>��hZ6�u��x���
I think part of the problem is that site-packages/requests/packages/urllib3/response.py
does not recognize gzip unless the header has 'content-encoding': 'gzip'
I was able to get the results I wanted by adding 4 lines to a method in response.py
like so:
def _init_decoder(self):
"""
Set-up the _decoder attribute if necessar.
"""
# Note: content-encoding value should be case-insensitive, per RFC 7230
# Section 3.2
content_encoding = self.headers.get('content-encoding', '').lower()
if self._decoder is None and content_encoding in self.CONTENT_DECODERS:
self._decoder = _get_decoder(content_encoding)
# My added code below this comment
return
content_type = self.headers.get('content-type', '').lower()
if self._decoder is None and content_type == 'application/x-gzip':
self._decoder = _get_decoder('gzip')
But, is there a better way?
The gzip module provides the GzipFile class, as well as the open() , compress() and decompress() convenience functions. The GzipFile class reads and writes gzip-format files, automatically compressing or decompressing the data so that it looks like an ordinary file object.
Gzip is a file format and software application used on Unix and Unix-like systems to compress HTTP content before it's served to a client.
"gzip" is often also used to refer to the gzip file format, which is: a 10-byte header, containing a magic number ( 1f 8b ), the compression method ( 08 for DEFLATE), 1-byte of header flags, a 4-byte timestamp, compression flags and the operating system ID.
You misunderstood. Only transport-level compression is taken care of automatically, so compression applied by the HTTP server.
You have compressed content. Since this wasn't applied just for the HTTP transport stage, requests
won't remove it either.
requests
communicates to the server that it accepts compressed responses by sending Accept-Encoding: gzip, deflate
with every request sent. The server can then respond by compressing the whole response body and adding a Content-Encoding
header indicating the compression used.
Your response has no Content-Encoding header, nor would applying compression again make sense here.
Most of the time you want to download an already compressed archive like the DMOZ RDF dataset in the compressed form, anyway. You requested a compressed archive after all. It is not the job of the requests
library to decode that.
In Python 3 you can handle decoding as a stream by using the gzip
module and streaming the response:
import gzip
import requests
import shutil
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True # just in case transport encoding was applied
gzip_file = gzip.GzipFile(fileobj=r.raw)
shutil.copyfileobj(gzip_file, f)
where you could use an RDF parser instead of copying the decompressed data to disk, of course.
Unfortunately the Python 2 implementation of the module requires a seekable file; you can create your own streaming wrapper, or by adding that _decoder
attribute to the r.raw
object above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With