I have a python dictionary which contains items that have non-english characters. When I print the dictionary, the python shell does not properly display the non-english characters. How can I fix this?
When your application prints hei\xdfen
instead of heißen
, it means you are not actually printing the actual unicode string, but instead, on the string representation of the unicode object.
Let us assume your string ("heißen") is stored into variable called text
. Just to make sure where you are at, check out the type of this variable by calling:
>>> type(text)
If you get <type 'unicode'>
, it means you are not dealing with a string, but instead a unicode
object.
If you do the intuive thing and try to print to text by invoking print(text)
you won't get out the actual text ("heißen") but instead, a string representation of a unicode object.
To fix this, you need to know which encoding your terminal has and print out your unicode object encoded according to the given encoding.
For instance, if your terminal uses UTF-8 encoding, you can print out a string by invoking:
text.encode('utf-8')
That's for the basic concepts. Now let me give you a more detailed example. Let us assume we have a source code file storing your dictionary. Like:
mydict = {'heiße': 'heiße', 'äää': 'ööö'}
When you type print mydict
you will get {'\xc3\xa4\xc3\xa4\xc3\xa4': '\xc3\xb6\xc3\xb6\xc3\xb6', 'hei\xc3\x9fe': 'hei\xc3\x9fe'}
. Even print mydict['äää']
doesn't work: it results in something like ├Â├Â├Â
. The nature of the problem is revealed by trying out print type(mydict['äää'])
which will tell you that you are dealing with a string
object.
In order to fix the problem, you first need to decode the string representation from your source code file's charset to unicode object and then represent it in the charset of your terminal. For individual dict items this can be achived by:
print unicode(mydict, 'utf-8')
Note that if default encoding doesn't apply to your terminal, you need to write:
print unicode(mydict, 'utf-8').encode('utf-8')
Where the outer encode method specifies the encoding according to your terminal.
I really really urge you to read through Joel's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". Unless you understand how character sets work, you will stumble across problems similar to this again and again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With