In Python 3, I'd like to be able to use re.sub()
in an "accent-insensitive" way, as we can do with the re.I
flag for case-insensitive substitution.
Could be something like a re.IGNOREACCENTS
flag:
original_text = "¿It's 80°C, I'm drinking a café in a cafe with Chloë。"
accent_regex = r'a café'
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)
This would lead to "¿It's 80°C, I'm drinking X in X with Chloë。" (note that there's still an accent on "Chloë") instead of "¿It's 80°C, I'm drinking X in a cafe with Chloë。" in real python.
I think that such a flag doesn't exist. So what would be the best option to do this? Using re.finditer
and unidecode
on both original_text
and accent_regex
and then replace by splitting the string? Or modifying all characters in the accent_regex
by their accented variants, for instance: r'[cç][aàâ]f[éèêë]'
?
We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.
What is this doing? Well, \D matches any character except a numeric digit, and + means 1 or more. So \D+ matches one or more characters that are not digits. This is what we're using instead of a literal hyphen, to try to match different separators.
re. IGNORECASE : This flag allows for case-insensitive matching of the Regular Expression with the given string i.e. expressions like [A-Z] will match lowercase letters, too. Generally, It's passed as an optional argument to re. compile() .
unidecode
is often mentioned for removing accents in Python, but it also does more than that : it converts '°'
to 'deg'
, which might not be the desired output.
unicodedata
seems to have enough functionality to remove accents.
This method should work with any pattern and any text.
You can temporarily remove the accents from both the text and regex pattern. The match information from re.finditer()
(start and end indices) can be used to modify the original, accented text.
Note that the matches must be reversed in order to not modify the following indices.
import re
import unicodedata
original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."
accented_pattern = r'a café|François Déporte'
def remove_accents(s):
return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
print(remove_accents('äöüßéèiìììíàáç'))
# aoußeeiiiiiaac
pattern = re.compile(remove_accents(accented_pattern))
modified_text = original_text
matches = list(re.finditer(pattern, remove_accents(original_text)))
for match in matches[::-1]:
modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]
print(modified_text)
# I'm drinking a 80° café in X with Chloë, X and X.
You could :
\w+
X
import re
from unidecode import unidecode
original_text = "I'm drinking a café in a cafe with Chloë."
def remove_accents(string):
return unidecode(string)
accented_words = ['café', 'français']
words_to_remove = set(remove_accents(word) for word in accented_words)
def remove_words(matchobj):
word = matchobj.group(0)
if remove_accents(word) in words_to_remove:
return 'X'
else:
return word
print(re.sub('\w+', remove_words, original_text))
# I'm drinking a X in a X with Chloë.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With