Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode the gzip compressed data returned in a HTTP Response in python?

I have created a client/server architecture in python, I take HTTP request from the client which is served by requesting another HTTP server through my code.

When I get the response from the third server I am not able to decode the gzip compressed data, I first split the response data using \r\n as separation character which got me the data as the last item in the list then I tried decompressing it with

zlib.decompress(data[-1]) 

but it is giving me an error of incorrect headers. How should I go with this problem ?

Code

client_reply = ''
                 while 1:
                     chunk = server2.recv(512)
                     if len(chunk) :
                         client.send(chunk)
                         client_reply += chunk
                     else:
                         break
                 client_split = client_reply.split("\r\n")
                 print client_split[-1].decode('zlib')

I want to read the data that is been transferred between the client and the 2nd server.

like image 542
vedarthk Avatar asked Mar 18 '12 20:03

vedarthk


2 Answers

Specify the wbits when using zlib.decompress(string, wbits, bufsize) see end of "troubleshooting" for example.

Troubleshooting

Lets start out with a a curl command that downloads a byte-range response with an unknown "content-encoding" (note: we know before hand it is some sort of compressed thing, mabye deflate maybe gzip):

export URL="https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00007-ip-10-239-7-51.ec2.internal.warc.gz"
curl -r 266472196-266527075 $URL | gzip -dc | tee hello.txt

With the following response headers:

HTTP/1.1 206 Partial Content
x-amz-id-2: IzdPq3DAPfitkgdXhEwzBSwkxwJRx9ICtfxnnruPCLSMvueRA8j7a05hKr++Na6s
x-amz-request-id: 14B89CED698E0954
Date: Sat, 06 Aug 2016 01:26:03 GMT
Last-Modified: Sat, 07 May 2016 08:39:18 GMT
ETag: "144a93586a13abf27cb9b82b10a87787"
Accept-Ranges: bytes
Content-Range: bytes 266472196-266527075/711047506
Content-Type: application/octet-stream
Content-Length: 54880
Server: AmazonS3

So to the point.

Lets display the hex output of the first 10 bytes: curl -r 266472196-266472208 $URL | xxd

hex output:

0000000: 1f8b 0800 0000 0000 0000 ecbd eb

We can see some basics of what we are working with with the hex values.

Roughly meaning its probably a gzip ( 1f8b ) using deflate ( 0800 ) without a modification time ( 0000 0000 ), or any extra flags set ( 00 ), using a fat32 system( 00 ).

Please refer to section 2.3 / 2.3.1: https://www.rfc-editor.org/rfc/rfc1952#section-2.3.1

So onto the python:

>>> import requests
>>> url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-18/segments/1461860106452.21/warc/CC-MAIN-20160428161506-00006-ip-10-239-7-51.ec2.internal.warc.gz'
>>> response = requests.get(url, params={"range":"bytes=257173173-257248267"})
>>> unknown_compressed_data = response.content

notice anything similar?:

>>> unknown_compressed_data[:10]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00'

And on to the decompression let's just try at random based on the (documentation):

>>> import zlib

"zlib.error: Error -2 while preparing to decompress data: inconsistent stream state":

>>> zlib.decompress(unknown_compressed_data, -31)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -2 while preparing to decompress data: inconsistent stream state

"Error -3 while decompressing data: incorrect header check":

>>> zlib.decompress(unknown_compressed_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check

"zlib.error: Error -3 while decompressing data: invalid distance too far back":

>>> zlib.decompress(unknown_compressed_data, 30)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: invalid distance too far back

Possible solution:

>>> zlib.decompress(unknown_compressed_data, 31)
'WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2016-04-28T20:14:16Z\r\nWARC-Record-ID: <urn:uu
like image 170
jmunsch Avatar answered Nov 11 '22 01:11

jmunsch


According to https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html the headers and the body are separated by an empty line containing only CRLF characters. You could try

client_split = client_reply.split("\r\n\r\n",1)
print client_split[1].decode('zlib')

The split finds the empty line and the additional parameter limits the number of splits - the result being array with two items, headers and body. But it is hard to recommend anything without knowing more about your code and the actual string being split.

like image 37
Zbyněk Winkler Avatar answered Nov 11 '22 02:11

Zbyněk Winkler