Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex to convert non-ascii characters in a string to closest ascii equivalents

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent. For example, diacritics and whatnot should be dropped. I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.

Example input/output:

"Étienne" -> "Etienne"
like image 971
dreeves Avatar asked Sep 30 '10 18:09

dreeves


2 Answers

Reading this question made me go looking for something better.

https://pypi.python.org/pypi/Unidecode/0.04.1

Does exactly what you ask for.

like image 89
Llanilek Avatar answered Oct 01 '22 11:10

Llanilek


In Python 3 and using the regex implementation at PyPI:

http://pypi.python.org/pypi/regex

Starting with the string:

>>> s = "Étienne"

Normalise to NFKD and then remove the diacritics:

>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'
like image 22
MRAB Avatar answered Oct 01 '22 09:10

MRAB