I need to compare two strings. aa
is extracted from a PDF file (using pdfminer/chardet) and bb
is a keyboard input. How can I normalize first string to make a comparison?
>>> aa = "ā"
>>> bb = "ā"
>>> aa == bb
False
>>>
>>> aa.encode('utf-8')
b'\xc4\x81'
>>> bb.encode('utf-8')
b'a\xcc\x84'
The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way.
1 Answer. The best way to remove accents in a Python Unicode string is to Unidecode, it is the correct answer for this. It renders any Unicode string into the closest possible representation in ASCII text.
You normalize with unicodedata.normalize:
>>> aa = b'\xc4\x81'.decode('utf8') # composed form
>>> bb = b'a\xcc\x84'.decode('utf8') # decomposed form
>>> aa
'ā'
>>> bb
'ā'
>>> aa == bb
False
>>> import unicodedata as ud
>>> aa == ud.normalize('NFC',bb) # compare composed
True
>>> ud.normalize('NFD',aa) == bb # compare decomposed
True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With