Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert a dict to a unicode JSON string?

Tags:

python

json

This doesn't appear to be possible to me using the standard library json module. When using json.dumps it will automatically escape all non-ASCII characters then encode the string to ASCII. I can specify that it not escape non-ASCII characters, but then it crashes when it tries to convert the output to ASCII.

The problem is - I don't want ASCII! I just want my JSON string back as a unicode (or UTF-8) string. Are there any convenient ways to do that?

Here's an example to demonstrate what I want:

d = {'navn': 'Åge', 'stilling': 'Lærling'}
json.dumps(d, output_encoding='utf8')
# => '{"stilling": "Lærling", "navn": "Åge"}'

But of course, there is no such option as output_encoding, so here's the actual output:

d = {'navn': 'Åge', 'stilling': 'Lærling'}
json.dumps(d)
# => '{"stilling": "L\\u00e6rling", "navn": "\\u00c5ge"}'

So to summarize - I want to convert a Python dict to an UTF-8 JSON string without any escapes. How can I do that?


I'll accept solutions like:

  • Hacks (pre- and post processing input to dumps to achieve the desired effect)
  • Subclassing the JSONEncoder (I have no idea how it works and the documentation isn't very helpful)
  • Third party libraries available on PyPi
like image 450
Hubro Avatar asked Jul 28 '12 08:07

Hubro


People also ask

Can dictionary be converted to JSON?

To Convert dictionary to JSON you can use the json. dumps() which converts a dictionary to str object, not a json(dict) object! so you have to load your str into a dict to use it by using json.

Which method is used to convert a dictionary to JSON string?

dumps() method: This method is used to convert the dictionary object into JSON data for parsing or reading and it is slower than dump() method.

How do you make a Unicode string in Python?

You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.


2 Answers

Requirements

  • Make sure your python files are encoded in UTF-8. Or else your non-ascii characters will become question marks, ?. Notepad++ has excellent encoding options for this.

  • Make sure that you have the appropriate fonts included. If you want to display Japanese characters then you need to install Japanese fonts.

  • Make sure that your IDE supports displaying unicode characters. Otherwise you might get an UnicodeEncodeError error thrown.

Example:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 22-23: character maps to <undefined>

PyScripter works for me. It's included with "Portable Python" at http://portablepython.com/wiki/PortablePython3.2.1.1

  • Make sure you're using Python 3+, since this version offers better unicode support.

Problem

json.dumps() escapes unicode characters.

Solution

Read the update at the bottom. Or...

Replace each escaped characters with the parsed unicode character.

I created a simple lambda function called getStringWithDecodedUnicode that does just that.

import re   
getStringWithDecodedUnicode = lambda str : re.sub( '\\\\u([\da-f]{4})', (lambda x : chr( int( x.group(1), 16 ) )), str )

Here's getStringWithDecodedUnicode as a regular function.

def getStringWithDecodedUnicode( value ):
    findUnicodeRE = re.compile( '\\\\u([\da-f]{4})' )
    def getParsedUnicode(x):
        return chr( int( x.group(1), 16 ) )

    return  findUnicodeRE.sub(getParsedUnicode, str( value ) )

Example

testJSONWithUnicode.py (Using PyScripter as the IDE)

import re
import json
getStringWithDecodedUnicode = lambda str : re.sub( '\\\\u([\da-f]{4})', (lambda x : chr( int( x.group(1), 16 ) )), str )

data = {"Japan":"日本"}
jsonString = json.dumps( data )
print( "json.dumps({0}) = {1}".format( data, jsonString ) )
jsonString = getStringWithDecodedUnicode( jsonString )
print( "Decoded Unicode: %s" % jsonString )

Output

json.dumps({'Japan': '日本'}) = {"Japan": "\u65e5\u672c"}
Decoded Unicode: {"Japan": "日本"}

Update

Or... just pass ensure_ascii=False as an option for json.dumps.

Note: You need to meet the requirements that I outlined at the beginning or else this isn't going to work.

import json
data = {'navn': 'Åge', 'stilling': 'Lærling'}
result = json.dumps(d, ensure_ascii=False)
print( result ) # prints '{"stilling": "Lærling", "navn": "Åge"}'
like image 146
Larry Battle Avatar answered Oct 14 '22 11:10

Larry Battle


encode_ascii=False is the best solution IMHO.

If you are using Python2.7, here is example python file :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# example.py
from __future__ import unicode_literals
from json import dumps as json_dumps
d = {'navn': 'Åge', 'stilling': 'Lærling'}
print json_dumps(d, ensure_ascii=False).encode('utf-8')
like image 22
Xiao Avatar answered Oct 14 '22 09:10

Xiao