Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving utf-8 texts with json.dumps as UTF8, not as \u escape sequence

Sample code:

>>> import json >>> json_string = json.dumps("ברי צקלה") >>> print(json_string) "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4" 

The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).

Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?

like image 533
Berry Tsakala Avatar asked Aug 20 '13 14:08

Berry Tsakala


People also ask

Can JSON handle UTF-8?

The default encoding is UTF-8. (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible.

What is the return type of JSON dumps?

dumps() takes in a json object and returns a string.

What is JSON dumps () method?

The dump() method is used when the Python objects have to be stored in a file. The dumps() is used when the objects are required to be in string format and is used for parsing, printing, etc, . The dump() needs the json file name in which the output has to be stored as an argument.

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.


2 Answers

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8') >>> json_string b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"' >>> print(json_string.decode()) "ברי צקלה" 

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:     json.dump("ברי צקלה", json_file, ensure_ascii=False) 

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:     json.dump(u"ברי צקלה", json_file, ensure_ascii=False) 

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:     data = json.dumps(u"ברי צקלה", ensure_ascii=False)     # unicode(data) auto-decodes data to unicode if str     json_file.write(unicode(data)) 

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" } >>> d {1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}  >>> s=json.dumps(d, ensure_ascii=False, encoding='utf8') >>> s u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}' >>> json.loads(s)['1'] u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4' >>> json.loads(s)['2'] u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4' >>> print json.loads(s)['1'] ברי צקלה >>> print json.loads(s)['2'] ברי צקלה 
like image 179
Martijn Pieters Avatar answered Oct 16 '22 21:10

Martijn Pieters


To write to a file

import codecs import json  with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:     json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False) 

To print to stdout

import json print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False)) 
like image 24
Hiep Tran Avatar answered Oct 16 '22 21:10

Hiep Tran