Python regex to convert non-ascii characters in a string to closest ascii equivalents

Question

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent. For example, diacritics and whatnot should be dropped. I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.

Example input/output:

"Étienne" -> "Etienne"

Example input/output:

"Étienne" -> "Etienne"

Llanilek · Accepted Answer

Reading this question made me go looking for something better.

https://pypi.python.org/pypi/Unidecode/0.04.1

Does exactly what you ask for.

MRAB · Answer

In Python 3 and using the regex implementation at PyPI:

http://pypi.python.org/pypi/regex

Starting with the string:

>>> s = "Étienne"

Normalise to NFKD and then remove the diacritics:

>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'

Python regex to convert non-ascii characters in a string to closest ascii equivalents

Tags:

python

regex

character-encoding

ascii

special-characters

dreeves

2 Answers

Llanilek

MRAB

Recent Activity

Donate For Us

Python regex to convert non-ascii characters in a string to closest ascii equivalents

Tags:

python

regex

character-encoding

ascii

special-characters

dreeves

2 Answers

Llanilek

MRAB

Related questions

Recent Activity

Donate For Us