Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python and character normalization

Tags:

Hello I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ?

like image 987
Hellnar Avatar asked Nov 12 '10 07:11

Hellnar


People also ask

What is Unicode normalization in Python?

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way.

What is character normalization?

Character normalization is a process that can improve recall. Improving recall by character normalization means that more documents are retrieved even if the documents do not exactly match the query.

What is Unicode normalization in NLP?

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.


2 Answers

I recommend using Unidecode module:

>>> from unidecode import unidecode >>> unidecode(u'ıöüç') 'iouc' 

Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.

like image 134
Constantin Avatar answered Oct 15 '22 00:10

Constantin


It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.

If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").

import unicodedata  def remove_nonspacing_marks(s):     "Decompose the unicode string s and remove non-spacing marks."     return ''.join(c for c in unicodedata.normalize('NFKD', s)                    if unicodedata.category(c) != 'Mn') 
like image 26
Gareth Rees Avatar answered Oct 15 '22 02:10

Gareth Rees