I have a very long JSON message that contains characters that go beyond the ASCII table. I convert it into a string as follows:
messStr = json.dumps(message,encoding='utf-8', ensure_ascii=False, sort_keys=True)
I need to store this string using a service that restricts its size to X bytes. I want to split the JSON string into pieces of length X and store them separately. I ran into some issues doing this (described here) so I want to compress the string slices to work around those issues. I tried to do this:
ss = mStr[start:fin] # get piece of length X
ssc = zlib.compress(ss) # compress it
When I do that, I get the following error from zlib.compress
:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 225: ordinal not in range(128)
What is the right way to compress a UTF-8 string and what is then the right way to decompress it?
A little addition to Martijn's response. I read in an Enthought blog a nifty one liner statement that will spare you the need to import zlib in your own code.
Safely compressing a string (including your json dump) would look like that:
ssc = ss.encode('utf-8').encode('zlib_codec')
Decompressing back to utf-8 would be:
ss = ssc.decode('zlib_codec').decode('utf-8')
Hope this helps.
Your JSON data is not UTF-8 encoded. The encoding
parameter to the json.dumps()
function instructs it how to interpret Python byte strings in message
(e.g. the input), not how to encode the resulting output. It doesn't encode the output at all because you used ensure_ascii=False
.
Encode the data before compression:
ssc = zlib.compress(ss.encode('utf8'))
When decompressing again, there is no need to decode from UTF-8; the json.loads()
function assumes UTF-8 if the input is a bytestring.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With