I'm using Python 2 to parse JSON from ASCII encoded text files.
When loading these files with either json
or simplejson
, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.
Is it possible to get string objects instead of Unicode ones?
>>> import json >>> original_list = ['a', 'b'] >>> json_list = json.dumps(original_list) >>> json_list '["a", "b"]' >>> new_list = json.loads(json_list) >>> new_list [u'a', u'b'] # I want these to be of type `str`, not `unicode`
This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.
Escapes characters of a UTF-8 encoded Unicode string using JSON-style escape sequences. The escaping rules are as follows, in priority order: If the code point is the double quote (0x22), it is escaped as \" (backslash double quote). If the code point is the backslash (0x5C), it is escaped as \\ (double backslash).
Unicode Characters. When all the strings represented in a JSON text are composed entirely of Unicode characters [UNICODE] (however escaped), then that JSON text is interoperable in the sense that all software implementations which parse it will agree on the contents of names and of string values in objects and arrays.
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
Since any JSON can represent unicode characters in escaped sequence \uXXXX , JSON can always be encoded in ASCII.
While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str
type strings instead of unicode
type. Because JSON is a subset of YAML it works nicely:
>>> import json >>> import yaml >>> list_org = ['a', 'b'] >>> list_dump = json.dumps(list_org) >>> list_dump '["a", "b"]' >>> json.loads(list_dump) [u'a', u'b'] >>> yaml.safe_load(list_dump) ['a', 'b']
Some things to note though:
I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!
You should (probably always) use PyYAML's safe_load
function; if you use it to load JSON files, you don't need the "additional power" of the load
function anyway.
If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml
and import ruamel.yaml as yaml
was all I needed in my tests.
As stated, there is no conversion! If you can't be sure to only deal with ASCII values (and you can't be sure most of the time), better use a conversion function:
I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook
instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With