Regex for accent insensitive replacement in python

Tags:

In Python 3, I'd like to be able to use re.sub() in an "accent-insensitive" way, as we can do with the re.I flag for case-insensitive substitution.

Could be something like a re.IGNOREACCENTS flag:

original_text = "¿It's 80°C, I'm drinking a café in a cafe with Chloë。"
accent_regex = r'a café'
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)

This would lead to "¿It's 80°C, I'm drinking X in X with Chloë。" (note that there's still an accent on "Chloë") instead of "¿It's 80°C, I'm drinking X in a cafe with Chloë。" in real python.

I think that such a flag doesn't exist. So what would be the best option to do this? Using re.finditer and unidecode on both original_text and accent_regex and then replace by splitting the string? Or modifying all characters in the accent_regex by their accented variants, for instance: r'[cç][aàâ]f[éèêë]'?

829

asked Apr 26 '17 12:04

Antoine Dusséaux

1 Answers

unidecode is often mentioned for removing accents in Python, but it also does more than that : it converts '°' to 'deg', which might not be the desired output.

unicodedata seems to have enough functionality to remove accents.

With any pattern

This method should work with any pattern and any text.

You can temporarily remove the accents from both the text and regex pattern. The match information from re.finditer() (start and end indices) can be used to modify the original, accented text.

Note that the matches must be reversed in order to not modify the following indices.

import re
import unicodedata

original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."

accented_pattern = r'a café|François Déporte'

def remove_accents(s):
    return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

print(remove_accents('äöüßéèiìììíàáç'))
# aoußeeiiiiiaac

pattern = re.compile(remove_accents(accented_pattern))

modified_text = original_text
matches = list(re.finditer(pattern, remove_accents(original_text)))

for match in matches[::-1]:
    modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]

print(modified_text)
# I'm drinking a 80° café in X with Chloë, X and X.

If pattern is a word or a set of words

You could :

remove the accents out of your pattern words and save them in a set for fast lookup
look for every word in your text with \w+
remove the accents from the word:
- If it matches, replace by X
- If it doesn't match, leave the word untouched

import re
from unidecode import unidecode

original_text = "I'm drinking a café in a cafe with Chloë."

def remove_accents(string):
    return unidecode(string)

accented_words = ['café', 'français']

words_to_remove = set(remove_accents(word) for word in accented_words)

def remove_words(matchobj):
    word = matchobj.group(0)
    if remove_accents(word) in words_to_remove:
        return 'X'
    else:
        return word

print(re.sub('\w+', remove_words, original_text))
# I'm drinking a X in a X with Chloë.

198

answered Sep 21 '22 09:09

Eric Duminil

Related questions
                            
                                Killed/MemoryError when creating a large dask.dataframe from delayed collection
                            
                                Hide/Remove ads with selenium python
                            
                                How Can I Write Charts to Python DocX Document
                            
                                difference between JavaScript bit-wise operator code and Python bit-wise operator code
                            
                                How to connect to a running instance of Outlook from Python
                            
                                Transactions with DynamoDB library Boto3
                            
                                Python jenkinsapi ignore certificate
                            
                                Remove a dimension from some variables in an xarray Dataset
                            
                                varying degree of shuffling using random module python
                            
                                Python NLP British English vs American English
                            
                                Handling pointers when wrapping C++ class with Cython
                            
                                How to retrive more than 10k lines from InfluxDB using Pandas?
                            
                                How to use PyCall in Julia to convert Python output to Julia DataFrame
                            
                                ValueError: Attempt to reuse RNNCell with a different variable scope than its first use
                            
                                Why doesn't my program approximate pi?
                            
                                Saving matplotlib table creates a lot of whitespace
                            
                                werkzeug generate_password_hash, is there any point?
                            
                                makecython++ causes fatal error: Python.h: No such file or directory despite python3-dev installed
                            
                                Can't Consume JSON Messages From Kafka Using Kafka-Python's Deserializer
                            
                                How to format Django setting file for flake8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex for accent insensitive replacement in python

Tags:

python

regex

unicode

non-ascii-characters

accent-insensitive