I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is: <pre class="prettyprint"><code>re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é") </code></pre> but that is not a general solution. If I use unicode categories like <code>\pL</code> I can't reduce the match to a specific character, in this case <code>e</code>.

A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular <code>e</code> <pre class="prettyprint"><code>re.match(r"^e$", unidecode("é")) </code></pre> Or in this simplified case <pre class="prettyprint"><code>unidecode("é") == "e" </code></pre> <hr> Another solution which doesn't depend on the unidecode-library, preserves unicode and gives more control is manually removing the diacritics as follows: Use unicodedata.normalize() to turn your input string into normal form D (for decomposed), making sure composite characters like <code>é</code> get turned into the decomposite form <code>e\u301</code> (e + COMBINING ACUTE ACCENT) <pre class="prettyprint"><code>>>> input = "Héllô" >>> input 'Héllô' >>> normalized = unicodedata.normalize("NFKD", input) >>> normalized 'He\u0301llo\u0302' </code></pre> Then, remove all codepoints which fall into the category Mark, Nonspacing (short <code>Mn</code>). Those are all characters that have no width themselves and just decorate the previous character. Use unicodedata.category() to determine the category. <pre class="prettyprint"><code>>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn") >>> stripped 'Hello' </code></pre> The result can be used as a source for regex-matching, just as in the unidecode-example above. Here's the whole thing as a function: <pre class="prettyprint"><code>def remove_diacritics(text): """ Returns a string with all diacritics (aka non-spacing marks) removed. For example "Héllô" will become "Hello". Useful for comparing strings in an accent-insensitive fashion. """ normalized = unicodedata.normalize("NFKD", text) return "".join(c for c in normalized if unicodedata.category(c) != "Mn") </code></pre>

Regex - match a character and all its diacritic variations (aka accent-insensitive)

Tags:

python

regex

python-3.x

diacritics

accent-insensitive

I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is:

re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é")

but that is not a general solution. If I use unicode categories like \pL I can't reduce the match to a specific character, in this case e.

265

asked Mar 03 '16 21:03

Felk

1 Answers

A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular e

re.match(r"^e$", unidecode("é"))

Or in this simplified case

unidecode("é") == "e"

Another solution which doesn't depend on the unidecode-library, preserves unicode and gives more control is manually removing the diacritics as follows:

Use unicodedata.normalize() to turn your input string into normal form D (for decomposed), making sure composite characters like é get turned into the decomposite form e\u301 (e + COMBINING ACUTE ACCENT)

>>> input = "Héllô"
>>> input
'Héllô'
>>> normalized = unicodedata.normalize("NFKD", input)
>>> normalized
'He\u0301llo\u0302'

Then, remove all codepoints which fall into the category Mark, Nonspacing (short Mn). Those are all characters that have no width themselves and just decorate the previous character. Use unicodedata.category() to determine the category.

>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn")
>>> stripped
'Hello'

The result can be used as a source for regex-matching, just as in the unidecode-example above. Here's the whole thing as a function:

def remove_diacritics(text):
    """
    Returns a string with all diacritics (aka non-spacing marks) removed.
    For example "Héllô" will become "Hello".
    Useful for comparing strings in an accent-insensitive fashion.
    """
    normalized = unicodedata.normalize("NFKD", text)
    return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

152

answered Sep 28 '22 02:09

Felk

Related questions
                            
                                Plotting multiple time series after a groupby in pandas
                            
                                Find the tf-idf score of specific words in documents using sklearn
                            
                                Add time to datetime
                            
                                Lowercase django query
                            
                                How can I know which element in a list triggered an any() function?
                            
                                Spark select top values in RDD
                            
                                Python Docx - Sections - Page Orientation
                            
                                Run shell script from python
                            
                                Why should not use list.sort in python
                            
                                Should I include tests and .pyc files when building package with setuptools?
                            
                                Is there a tool to create repo manifest file with SHA based on current work directory?
                            
                                Renaming values in pandas
                            
                                How to perform oauth when doing twitter scraping with python requests
                            
                                Python Pandas printing out values of each cells
                            
                                Having trouble installing GDAL for python
                            
                                Splitting a string based on a certain set of words
                            
                                Django : TemplateDoesNotExist at /.../
                            
                                Difference between model fields(in django) and serializer fields(in django rest framework)
                            
                                Python - Ignore letter case
                            
                                Python Numpy Poisson Distribution

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex - match a character and all its diacritic variations (aka accent-insensitive)

Tags:

python

regex

python-3.x

diacritics

accent-insensitive

Felk

People also ask

1 Answers

Felk

Recent Activity

Donate For Us