Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to "normalize" python 3 unicode string

I need to compare two strings. aa is extracted from a PDF file (using pdfminer/chardet) and bb is a keyboard input. How can I normalize first string to make a comparison?

>>> aa = "ā"
>>> bb = "ā"
>>> aa == bb
False
>>> 
>>> aa.encode('utf-8')
b'\xc4\x81'
>>> bb.encode('utf-8')
b'a\xcc\x84'
like image 530
rudensm Avatar asked Nov 03 '17 10:11

rudensm


People also ask

What is Unicode normalization in Python?

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way.

What is the best way to remove accents normalize in a Python Unicode string?

1 Answer. The best way to remove accents in a Python Unicode string is to Unidecode, it is the correct answer for this. It renders any Unicode string into the closest possible representation in ASCII text.


1 Answers

You normalize with unicodedata.normalize:

>>> aa = b'\xc4\x81'.decode('utf8')   # composed form
>>> bb = b'a\xcc\x84'.decode('utf8')  # decomposed form
>>> aa
'ā'
>>> bb
'ā'
>>> aa == bb
False
>>> import unicodedata as ud
>>> aa == ud.normalize('NFC',bb)  # compare composed
True
>>> ud.normalize('NFD',aa) == bb  # compare decomposed
True
like image 86
Mark Tolonen Avatar answered Sep 21 '22 16:09

Mark Tolonen