Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the right way to compress and decompress UTF-8 data using zlib?

I have a very long JSON message that contains characters that go beyond the ASCII table. I convert it into a string as follows:

messStr = json.dumps(message,encoding='utf-8', ensure_ascii=False, sort_keys=True)

I need to store this string using a service that restricts its size to X bytes. I want to split the JSON string into pieces of length X and store them separately. I ran into some issues doing this (described here) so I want to compress the string slices to work around those issues. I tried to do this:

ss = mStr[start:fin]    # get piece of length X
ssc = zlib.compress(ss) # compress it

When I do that, I get the following error from zlib.compress:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 225: ordinal not in range(128)

What is the right way to compress a UTF-8 string and what is then the right way to decompress it?

like image 285
I Z Avatar asked Aug 26 '13 17:08

I Z


2 Answers

A little addition to Martijn's response. I read in an Enthought blog a nifty one liner statement that will spare you the need to import zlib in your own code.

Safely compressing a string (including your json dump) would look like that:

ssc = ss.encode('utf-8').encode('zlib_codec')

Decompressing back to utf-8 would be:

ss = ssc.decode('zlib_codec').decode('utf-8')

Hope this helps.

like image 188
Lynx-Lab Avatar answered Oct 12 '22 22:10

Lynx-Lab


Your JSON data is not UTF-8 encoded. The encoding parameter to the json.dumps() function instructs it how to interpret Python byte strings in message (e.g. the input), not how to encode the resulting output. It doesn't encode the output at all because you used ensure_ascii=False.

Encode the data before compression:

ssc = zlib.compress(ss.encode('utf8'))

When decompressing again, there is no need to decode from UTF-8; the json.loads() function assumes UTF-8 if the input is a bytestring.

like image 27
Martijn Pieters Avatar answered Oct 12 '22 23:10

Martijn Pieters