What is the right way to compress and decompress UTF-8 data using zlib?

Question

I have a very long JSON message that contains characters that go beyond the ASCII table. I convert it into a string as follows:

messStr = json.dumps(message,encoding='utf-8', ensure_ascii=False, sort_keys=True)

I need to store this string using a service that restricts its size to X bytes. I want to split the JSON string into pieces of length X and store them separately. I ran into some issues doing this (described here) so I want to compress the string slices to work around those issues. I tried to do this:

ss = mStr[start:fin]    # get piece of length X
ssc = zlib.compress(ss) # compress it

When I do that, I get the following error from zlib.compress:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 225: ordinal not in range(128)

What is the right way to compress a UTF-8 string and what is then the right way to decompress it?

Lynx-Lab · Accepted Answer

A little addition to Martijn's response. I read in an Enthought blog a nifty one liner statement that will spare you the need to import zlib in your own code.

Safely compressing a string (including your json dump) would look like that:

ssc = ss.encode('utf-8').encode('zlib_codec')

Decompressing back to utf-8 would be:

ss = ssc.decode('zlib_codec').decode('utf-8')

Hope this helps.

Martijn Pieters · Answer

Your JSON data is not UTF-8 encoded. The encoding parameter to the json.dumps() function instructs it how to interpret Python byte strings in message (e.g. the input), not how to encode the resulting output. It doesn't encode the output at all because you used ensure_ascii=False.

Encode the data before compression:

ssc = zlib.compress(ss.encode('utf8'))

When decompressing again, there is no need to decode from UTF-8; the json.loads() function assumes UTF-8 if the input is a bytestring.

What is the right way to compress and decompress UTF-8 data using zlib?

Tags:

python

json

python-2.x

utf-8

compression

I Z

2 Answers

Lynx-Lab

Martijn Pieters

Recent Activity

Donate For Us

What is the right way to compress and decompress UTF-8 data using zlib?

Tags:

python

json

python-2.x

utf-8

compression

I Z

2 Answers

Lynx-Lab

Martijn Pieters

Related questions

Recent Activity

Donate For Us