I found the answer here : Python UnicodeDecodeError - Am I misunderstanding encode?
I needed to explicitly decode my incoming file into Unicode when I read it. Because it had characters that were neither acceptable ascii nor unicode. So the encode was failing when it hit these characters.
So, I know there's something I'm just not getting here.
I have an array of unicode strings, some of which contain non-Ascii characters.
I want to encode that as json with
json.dumps(myList)
It throws an error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 13: ordinal not in range(128)
How am I supposed to do this? I've tried setting the ensure_ascii parameter to both True and False, but neither fixes this problem.
I know I'm passing unicode strings to json.dumps. I understand that a json string is meant to be unicode. Why isn't it just sorting this out for me?
What am I doing wrong?
Update : Don Question sensibly suggests I provide a stack-trace. Here it is. :
Traceback (most recent call last):
File "importFiles.py", line 69, in <module>
x = u"%s" % conv
File "importFiles.py", line 62, in __str__
return self.page.__str__()
File "importFiles.py", line 37, in __str__
return json.dumps(self.page(),ensure_ascii=False)
File "/usr/lib/python2.7/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 204, in encode
return ''.join(chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 17: ordinal not in range(128)
Note it's python 2.7, and the error is still occurring with ensure_ascii=False
Update 2 : Andrew Walker's useful link (in comments) leads me to think I can coerce my data into a convenient byte format before trying to json.encode it by doing something like :
data.encode("ascii","ignore")
Unfortunately that is throwing the same error.
Encode Unicode data in utf-8 format. The Python RFC 7159 requires that JSON be represented using either UTF-8, UTF-16, or UTF-32, with UTF-8 being the recommended default for maximum interoperability. Use Python’s built-in module json provides the json.dump () and json.dumps () method to encode Python objects into JSON data.
Use Python’s built-in module json provides the json.dump () and json.dumps () method to encode Python objects into JSON data. The json.dump () and json.dumps () has a ensure_ascii parameter. The ensure_ascii is by-default true so the output is guaranteed to have all incoming non-ASCII characters escaped.
The RFC does not explicitly forbid JSON strings which contain byte sequences that don’t correspond to valid Unicode characters (e.g. unpaired UTF-16 surrogates), but it does note that they may cause interoperability problems. By default, this module accepts and outputs (when present in the original str) code points for such sequences.
Python 3 is all-in on Unicode and UTF-8 specifically. Here’s what that means: Python 3 source code is assumed to be UTF-8 by default. This means that you don’t need # -*- coding: UTF-8 -*- at the top of .py files in Python 3. All text ( str) is Unicode by default. Encoded Unicode text is represented as binary data ( bytes ).
Try adding the argument: ensure_ascii = False
. Also especially if asking unicode-related issues it's very helpful to add a longer (complete) traceback and stating which python-version you are using.
Citing the python-documentation: of version 2.6.7 :
"If ensure_ascii is False (default: True), then some chunks written to fp may be unicode instances, subject to normal Python str to unicode coercion rules. Unless fp.write() explicitly understands unicode (as in codecs.getwriter()) this is likely to cause an error."
So this proposal may cause new problems, but it fixed a similar problem i had. I fed the resulting unicode-String into a StringIO-object and wrrote this to a file.
Because of python 2.7 and sys.getdefaultencoding set to ascii
the implicit conversion through the ''.join(chunks)
statement of the json-standard-library will blow up if chunks
is not ascii-encoded! You must ensure that any contained strings are converted to an ascii-compatible representation before-hand! You may try utf-8 encoded strings, but unicode-strings won't work if i'm not mistaken.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With