Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

like image 651
MiniQuark Avatar asked Feb 05 '09 21:02

MiniQuark


People also ask

How do you get rid of accents in Python?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.


1 Answers

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga' # accented_string is of type 'unicode' import unidecode unaccented_string = unidecode.unidecode(accented_string) # unaccented_string contains 'Malaga'and is of type 'str' 
like image 138
Christian Oudard Avatar answered Oct 18 '22 11:10

Christian Oudard