Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I compare a Unicode string that has different bytes, but the same value?

Tags:

python

unicode

I'm comparing Unicode strings between JSON objects.

They have the same value:

a = '人口じんこうに膾炙かいしゃする' b = '人口じんこうに膾炙かいしゃする' 

But they have different Unicode representations:

String a : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\u7099\u304b\u3044\u3057\u3083\u3059\u308b' String b : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\uf9fb\u304b\u3044\u3057\u3083\u3059\u308b' 

How can I compare between two Unicode strings on their value?

like image 393
Seunghoon Baek Avatar asked Apr 05 '18 00:04

Seunghoon Baek


2 Answers

Unicode normalization will get you there for this one:

>>> import unicodedata >>> unicodedata.normalize("NFC", "\uf9fb") == "\u7099" True 

Use unicodedata.normalize on both of your strings before comparing them with == to check for canonical Unicode equivalence.

Character U+F9FB is a "CJK Compatibility" character. These characters decompose into one or more regular CJK characters when normalized.

like image 84
Ry- Avatar answered Sep 26 '22 15:09

Ry-


Character U+F9FB (炙) is a CJK Compatibility Ideograph. These characters are distinct code points from the regular CJK characters, but they decompose into one or more regular CJK characters when normalized.

Unicode has an official string collation algorithm called UCA designed for exactly this purpose. Python does not come with UCA support as of 3.7,* but there are third-party libraries like pyuca:

>>> from pyuca import Collator >>> ck = Collator().sort_key >>> ck(a) == ck(b) True 

For this case—and many others, but definitely not all—picking the appropriate normalization to apply to both strings before comparing will work, and it has the advantage of support built into the stdlib.

* The idea has been accepted in principle since 3.4, but nobody has written an implementation—in part because most of the core devs who care are using pyuca or one of the two ICU bindings, which have the advantage of working in current and older versions of Python.

like image 20
abarnert Avatar answered Sep 26 '22 15:09

abarnert