How to decode the gzip compressed data returned in a HTTP Response in python?

Question

I have created a client/server architecture in python, I take HTTP request from the client which is served by requesting another HTTP server through my code.

When I get the response from the third server I am not able to decode the gzip compressed data, I first split the response data using as separation character which got me the data as the last item in the list then I tried decompressing it with

zlib.decompress(data[-1])

but it is giving me an error of incorrect headers. How should I go with this problem ?

Code

client_reply = ''
                 while 1:
                     chunk = server2.recv(512)
                     if len(chunk) :
                         client.send(chunk)
                         client_reply += chunk
                     else:
                         break
                 client_split = client_reply.split("
")
                 print client_split[-1].decode('zlib')

I want to read the data that is been transferred between the client and the 2nd server.

jmunsch · Accepted Answer

Specify the wbits when using zlib.decompress(string, wbits, bufsize) see end of "troubleshooting" for example.

Troubleshooting

Lets start out with a a curl command that downloads a byte-range response with an unknown "content-encoding" (note: we know before hand it is some sort of compressed thing, mabye deflate maybe gzip):

export URL="https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00007-ip-10-239-7-51.ec2.internal.warc.gz"
curl -r 266472196-266527075 $URL | gzip -dc | tee hello.txt

With the following response headers:

HTTP/1.1 206 Partial Content
x-amz-id-2: IzdPq3DAPfitkgdXhEwzBSwkxwJRx9ICtfxnnruPCLSMvueRA8j7a05hKr++Na6s
x-amz-request-id: 14B89CED698E0954
Date: Sat, 06 Aug 2016 01:26:03 GMT
Last-Modified: Sat, 07 May 2016 08:39:18 GMT
ETag: "144a93586a13abf27cb9b82b10a87787"
Accept-Ranges: bytes
Content-Range: bytes 266472196-266527075/711047506
Content-Type: application/octet-stream
Content-Length: 54880
Server: AmazonS3

So to the point.

Lets display the hex output of the first 10 bytes: curl -r 266472196-266472208 $URL | xxd

hex output:

0000000: 1f8b 0800 0000 0000 0000 ecbd eb

We can see some basics of what we are working with with the hex values.

Roughly meaning its probably a gzip ( 1f8b ) using deflate ( 0800 ) without a modification time ( 0000 0000 ), or any extra flags set ( 00 ), using a fat32 system( 00 ).

Please refer to section 2.3 / 2.3.1: https://www.rfc-editor.org/rfc/rfc1952#section-2.3.1

So onto the python:

>>> import requests
>>> url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00006-ip-10-239-7-51.ec2.internal.warc.gz'
>>> response = requests.get(url, params={"range":"bytes=257173173-257248267"})
>>> unknown_compressed_data = response.content

notice anything similar?:

>>> unknown_compressed_data[:10]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00'

And on to the decompression let's just try at random based on the (documentation):

>>> import zlib

"zlib.error: Error -2 while preparing to decompress data: inconsistent stream state":

>>> zlib.decompress(unknown_compressed_data, -31)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -2 while preparing to decompress data: inconsistent stream state

"Error -3 while decompressing data: incorrect header check":

>>> zlib.decompress(unknown_compressed_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check

"zlib.error: Error -3 while decompressing data: invalid distance too far back":

>>> zlib.decompress(unknown_compressed_data, 30)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: invalid distance too far back

Possible solution:

>>> zlib.decompress(unknown_compressed_data, 31)
'WARC/1.0
WARC-Type: response
WARC-Date: 2016-04-28T20:14:16Z
WARC-Record-ID: <urn:uu

Zbyněk Winkler · Answer

According to https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html the headers and the body are separated by an empty line containing only CRLF characters. You could try

client_split = client_reply.split("

",1)
print client_split[1].decode('zlib')

The split finds the empty line and the additional parameter limits the number of splits - the result being array with two items, headers and body. But it is hard to recommend anything without knowing more about your code and the actual string being split.

How to decode the gzip compressed data returned in a HTTP Response in python?

Tags:

python

http

python-2.x

zlib

sockets

vedarthk

2 Answers

Troubleshooting

Possible solution:

jmunsch

Zbyněk Winkler

Recent Activity

Donate For Us

How to decode the gzip compressed data returned in a HTTP Response in python?

Tags:

python

http

python-2.x

zlib

sockets

vedarthk

2 Answers

Troubleshooting

Possible solution:

jmunsch

Zbyněk Winkler

Related questions

Recent Activity

Donate For Us