Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get string objects instead of Unicode from JSON?

I'm using Python 2 to parse JSON from ASCII encoded text files.

When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.

Is it possible to get string objects instead of Unicode ones?

Example

>>> import json >>> original_list = ['a', 'b'] >>> json_list = json.dumps(original_list) >>> json_list '["a", "b"]' >>> new_list = json.loads(json_list) >>> new_list [u'a', u'b']  # I want these to be of type `str`, not `unicode`

Update

This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.

like image 653
Brutus Avatar asked Jun 05 '09 16:06

Brutus


People also ask

How do I escape a Unicode character in JSON?

Escapes characters of a UTF-8 encoded Unicode string using JSON-style escape sequences. The escaping rules are as follows, in priority order: If the code point is the double quote (0x22), it is escaped as \" (backslash double quote). If the code point is the backslash (0x5C), it is escaped as \\ (double backslash).

Does JSON use Unicode?

Unicode Characters. When all the strings represented in a JSON text are composed entirely of Unicode characters [UNICODE] (however escaped), then that JSON text is interoperable in the sense that all software implementations which parse it will agree on the contents of names and of string values in objects and arrays.

Is Unicode the same as string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

Is JSON an Ascii?

Since any JSON can represent unicode characters in escaped sequence \uXXXX , JSON can always be encoded in ASCII.


1 Answers

While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str type strings instead of unicode type. Because JSON is a subset of YAML it works nicely:

>>> import json >>> import yaml >>> list_org = ['a', 'b'] >>> list_dump = json.dumps(list_org) >>> list_dump '["a", "b"]' >>> json.loads(list_dump) [u'a', u'b'] >>> yaml.safe_load(list_dump) ['a', 'b'] 

Notes

Some things to note though:

  • I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!

  • You should (probably always) use PyYAML's safe_load function; if you use it to load JSON files, you don't need the "additional power" of the load function anyway.

  • If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml and import ruamel.yaml as yaml was all I needed in my tests.

Conversion

As stated, there is no conversion! If you can't be sure to only deal with ASCII values (and you can't be sure most of the time), better use a conversion function:

I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.

like image 145
Brutus Avatar answered Sep 19 '22 18:09

Brutus