Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python JSON and Unicode

Update :

I found the answer here : Python UnicodeDecodeError - Am I misunderstanding encode?

I needed to explicitly decode my incoming file into Unicode when I read it. Because it had characters that were neither acceptable ascii nor unicode. So the encode was failing when it hit these characters.

Original Question

So, I know there's something I'm just not getting here.

I have an array of unicode strings, some of which contain non-Ascii characters.

I want to encode that as json with

json.dumps(myList)

It throws an error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 13: ordinal not in range(128)

How am I supposed to do this? I've tried setting the ensure_ascii parameter to both True and False, but neither fixes this problem.

I know I'm passing unicode strings to json.dumps. I understand that a json string is meant to be unicode. Why isn't it just sorting this out for me?

What am I doing wrong?

Update : Don Question sensibly suggests I provide a stack-trace. Here it is. :

Traceback (most recent call last):
  File "importFiles.py", line 69, in <module>
    x = u"%s" % conv
  File "importFiles.py", line 62, in __str__
    return self.page.__str__()
  File "importFiles.py", line 37, in __str__
    return json.dumps(self.page(),ensure_ascii=False)
  File "/usr/lib/python2.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 204, in encode
    return ''.join(chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 17: ordinal not in range(128)

Note it's python 2.7, and the error is still occurring with ensure_ascii=False

Update 2 : Andrew Walker's useful link (in comments) leads me to think I can coerce my data into a convenient byte format before trying to json.encode it by doing something like :

data.encode("ascii","ignore")

Unfortunately that is throwing the same error.

like image 201
interstar Avatar asked Mar 13 '12 23:03

interstar


People also ask

How do I convert Unicode to JSON in Python?

Encode Unicode data in utf-8 format. The Python RFC 7159 requires that JSON be represented using either UTF-8, UTF-16, or UTF-32, with UTF-8 being the recommended default for maximum interoperability. Use Python’s built-in module json provides the json.dump () and json.dumps () method to encode Python objects into JSON data.

How to encode Python objects into JSON data?

Use Python’s built-in module json provides the json.dump () and json.dumps () method to encode Python objects into JSON data. The json.dump () and json.dumps () has a ensure_ascii parameter. The ensure_ascii is by-default true so the output is guaranteed to have all incoming non-ASCII characters escaped.

Why can’t I use Unicode characters in JSON strings?

The RFC does not explicitly forbid JSON strings which contain byte sequences that don’t correspond to valid Unicode characters (e.g. unpaired UTF-16 surrogates), but it does note that they may cause interoperability problems. By default, this module accepts and outputs (when present in the original str) code points for such sequences.

What is Unicode in Python 3?

Python 3 is all-in on Unicode and UTF-8 specifically. Here’s what that means: Python 3 source code is assumed to be UTF-8 by default. This means that you don’t need # -*- coding: UTF-8 -*- at the top of .py files in Python 3. All text ( str) is Unicode by default. Encoded Unicode text is represented as binary data ( bytes ).


1 Answers

Try adding the argument: ensure_ascii = False. Also especially if asking unicode-related issues it's very helpful to add a longer (complete) traceback and stating which python-version you are using.

Citing the python-documentation: of version 2.6.7 :

"If ensure_ascii is False (default: True), then some chunks written to fp may be unicode instances, subject to normal Python str to unicode coercion rules. Unless fp.write() explicitly understands unicode (as in codecs.getwriter()) this is likely to cause an error."

So this proposal may cause new problems, but it fixed a similar problem i had. I fed the resulting unicode-String into a StringIO-object and wrrote this to a file.

Because of python 2.7 and sys.getdefaultencoding set to ascii the implicit conversion through the ''.join(chunks) statement of the json-standard-library will blow up if chunks is not ascii-encoded! You must ensure that any contained strings are converted to an ascii-compatible representation before-hand! You may try utf-8 encoded strings, but unicode-strings won't work if i'm not mistaken.

like image 113
Don Question Avatar answered Sep 28 '22 05:09

Don Question