Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex - match a character and all its diacritic variations (aka accent-insensitive)

I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is:

re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é")

but that is not a general solution. If I use unicode categories like \pL I can't reduce the match to a specific character, in this case e.

like image 265
Felk Avatar asked Mar 03 '16 21:03

Felk


People also ask

What will the'$'regular expression match?

It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.

What is the regex for Unicode?

To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0.

What is\\ d in r?

In the regular expression above, each '\\d' means a digit, and '. ' can match anything in between (look at the number 1 in the list of expressions in the beginning). So we got the digits, then a special character in between, three more digits, then special characters again, then 4 more digits.

Are there different types of regex?

There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression. A few utilities like awk and egrep use the extended expression. Most use the "basic" regular expression.


1 Answers

A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular e

re.match(r"^e$", unidecode("é"))

Or in this simplified case

unidecode("é") == "e"

Another solution which doesn't depend on the unidecode-library, preserves unicode and gives more control is manually removing the diacritics as follows:

Use unicodedata.normalize() to turn your input string into normal form D (for decomposed), making sure composite characters like é get turned into the decomposite form e\u301 (e + COMBINING ACUTE ACCENT)

>>> input = "Héllô"
>>> input
'Héllô'
>>> normalized = unicodedata.normalize("NFKD", input)
>>> normalized
'He\u0301llo\u0302'

Then, remove all codepoints which fall into the category Mark, Nonspacing (short Mn). Those are all characters that have no width themselves and just decorate the previous character. Use unicodedata.category() to determine the category.

>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn")
>>> stripped
'Hello'

The result can be used as a source for regex-matching, just as in the unidecode-example above. Here's the whole thing as a function:

def remove_diacritics(text):
    """
    Returns a string with all diacritics (aka non-spacing marks) removed.
    For example "Héllô" will become "Hello".
    Useful for comparing strings in an accent-insensitive fashion.
    """
    normalized = unicodedata.normalize("NFKD", text)
    return "".join(c for c in normalized if unicodedata.category(c) != "Mn")
like image 152
Felk Avatar answered Sep 28 '22 02:09

Felk