Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python UTF-8 comparison

a = {"a":"çö"}
b = "çö"
a['a']
>>> '\xc3\xa7\xc3\xb6'

b.decode('utf-8') == a['a']
>>> False

What is going in there?

edit= I'm sorry, it was my mistake. It is still False. I'm using Python 2.6 on Ubuntu 10.04.

like image 928
erkangur Avatar asked Aug 03 '10 19:08

erkangur


People also ask

How do I compare two Unicode strings in Python?

The characters having greater Unicode values are considered as greater value characters. For comparison of two strings, there is no special way. If we directly compare the values of strings, we use the '==' operator. If strings are identical, it returns True, otherwise False.

Does Python use UTF-8?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it.

Is Python a UTF-16?

In Python 2, the default encoding is ASCII (unfortunately). UTF-16 is variable 2 or 4 bytes. This encoding is great for Asian text as most of it can be encoded in 2 bytes each. It's bad for English as all English characters also need 2 bytes here.

Does Python use ASCII or Unicode?

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes. This does not distinguish "Unicode or ASCII"; it only distinguishes Python types.


1 Answers

Possible solutions

Either write like this:

a = {"a": u"çö"}
b = "çö"
b.decode('utf-8') == a['a']

Or like this (you may also skip the .decode('utf-8') on both sides):

a = {"a": "çö"}
b = "çö"
b.decode('utf-8') == a['a'].decode('utf-8')

Or like this (my recommendation):

a = {"a": u"çö"}
b = u"çö"
b == a['a']

Explanation

Updated based on Tim's comment. In your original code, b.decode('utf-8') == u'çö' and a['a'] == 'çö', so you're actually making the following comparison:

u'çö' == 'çö'

One of the objects is of type unicode, the other is of type str, so in order to execute the comparison, the str is converted to unicode and then the two unicode objects are compared. It works fine in the case of purely ASCII strings, for example: u'a' == 'a', since unicode('a') == u'a'.

However, it fails in case of u'çö' == 'çö', since unicode('çö') returns the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128), and therefore the whole comparison returns False and issues the following warning: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.

like image 100
Bolo Avatar answered Oct 13 '22 04:10

Bolo