I would like to change all accented characters into non-accented characters:
conversion_dict = {"ä": "a", "ö": "o", "ü": "u","Ä": "A", "Ö": "O", "Ü": "U",
"á": "a", "à": "a", "â": "a", "é": "e", "è": "e", "ê": "e",
"ú": "u", "ù": "u", "û": "u", "ó": "o", "ò": "o", "ô": "o",
"Á": "A", "À": "A", "Â": "A", "É": "E", "È": "E", "Ê": "E",
"Ú": "U", "Ù": "U", "Û": "U", "Ó": "O", "Ò": "O", "Ô": "O","ß": "s"}
Is there a way to do something like "paragraph of text".replace([conversion_dict])
?
preferred method using third-party module
A much better alternative than the method below is to use the awesome unidecode module:
>>> import unidecode
>>> somestring = u"äüÊÂ"
>>> unidecode.unidecode(somestring)
'auEA'
built-in, slightly-hazardous method
Inferring from your question that you are looking to normalize unicode characters, there is actually a nice, built-in way to do this:
>>> somestring = u"äüÊÂ"
>>> somestring
u'\xe4\xfc\xca\xc2'
>>> import unicodedata
>>> unicodedata.normalize('NFKD', somestring).encode('ascii', 'ignore')
'auEA'
Check out the documentation for unicodedata.normalize.
Note, however, that there might be some issues with this. See this post for a nice explanation and some workarounds.
See also, latin-1-to-ascii for alternatives.
for k, v in conversion_dict.items():
txt = txt.replace(k, v)
ETA: This isn't "horribly" slow at all. Here's a timer for a toy case, where we're replacing a 100000 character string using a dictionary that has mappings of 56 characters in it where none of the characters are in the string:
import timeit
NUM_REPEATS = 100000
conversion_dict = dict([(chr(i), "C") for i in xrange(100)])
txt = "A" * 100000
def replace(x):
for k, v in conversion_dict.items():
x = x.replace(k, v)
t = timeit.Timer("replace(txt)", setup="from __main__ import replace, txt")
print t.timeit(NUM_REPEATS) / NUM_REPEATS, "sec / call"
On my computer I get the running time
0.0056938188076 sec / call
So one two-hundredth of a second for a 100,000 character string. Now, some of the characters actually will be in the string, and this will slow it down, but in almost any reasonable situation the replaced characters will be much rarer than other characters. Still, jterrace's answer is perfect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With