Hello I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç"
while I want to normalize them to English such as "ıöüç"
-> "iouc"
. What would be the best way to achieve this ?
The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way.
Character normalization is a process that can improve recall. Improving recall by character normalization means that more documents are retrieved even if the documents do not exactly match the query.
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
I recommend using Unidecode module:
>>> from unidecode import unidecode >>> unidecode(u'ıöüç') 'iouc'
Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.
It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ
to abg
) then unidecode
is the way to go.
If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á
to a plain letter a
followed by U+0301 COMBINING ACUTE ACCENT
) and then discarding the accents (which belong to the Unicode character class Mn
— "Mark, nonspacing").
import unicodedata def remove_nonspacing_marks(s): "Decompose the unicode string s and remove non-spacing marks." return ''.join(c for c in unicodedata.normalize('NFKD', s) if unicodedata.category(c) != 'Mn')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With