Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to convert umlauts to their ascii equivalent? [duplicate]

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

like image 559
MiniQuark Avatar asked Nov 24 '22 18:11

MiniQuark


1 Answers

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'
like image 195
Christian Oudard Avatar answered Nov 27 '22 08:11

Christian Oudard