Possible to do a string replace with a dictionary?

Question

I would like to change all accented characters into non-accented characters:

 conversion_dict = {"ä": "a", "ö": "o", "ü": "u","Ä": "A", "Ö": "O", "Ü": "U",
                   "á": "a", "à": "a", "â": "a", "é": "e", "è": "e", "ê": "e",
                   "ú": "u", "ù": "u", "û": "u", "ó": "o", "ò": "o", "ô": "o",
                   "Á": "A", "À": "A", "Â": "A", "É": "E", "È": "E", "Ê": "E",
                   "Ú": "U", "Ù": "U", "Û": "U", "Ó": "O", "Ò": "O", "Ô": "O","ß": "s"}

Is there a way to do something like "paragraph of text".replace([conversion_dict])?

jterrace · Accepted Answer

preferred method using third-party module

A much better alternative than the method below is to use the awesome unidecode module:

>>> import unidecode
>>> somestring = u"äüÊÂ"
>>> unidecode.unidecode(somestring)
'auEA'

built-in, slightly-hazardous method

Inferring from your question that you are looking to normalize unicode characters, there is actually a nice, built-in way to do this:

>>> somestring = u"äüÊÂ"
>>> somestring
u'\xe4\xfc\xca\xc2'
>>> import unicodedata
>>> unicodedata.normalize('NFKD', somestring).encode('ascii', 'ignore')
'auEA'

Check out the documentation for unicodedata.normalize.

Note, however, that there might be some issues with this. See this post for a nice explanation and some workarounds.

See also, latin-1-to-ascii for alternatives.

David Robinson · Answer

for k, v in conversion_dict.items():
    txt = txt.replace(k, v)

ETA: This isn't "horribly" slow at all. Here's a timer for a toy case, where we're replacing a 100000 character string using a dictionary that has mappings of 56 characters in it where none of the characters are in the string:

import timeit

NUM_REPEATS = 100000

conversion_dict = dict([(chr(i), "C") for i in xrange(100)])

txt = "A" * 100000

def replace(x):
    for k, v in conversion_dict.items():
        x = x.replace(k, v)

t = timeit.Timer("replace(txt)", setup="from __main__ import replace, txt")
print t.timeit(NUM_REPEATS) / NUM_REPEATS, "sec / call"

On my computer I get the running time

0.0056938188076 sec / call

So one two-hundredth of a second for a 100,000 character string. Now, some of the characters actually will be in the string, and this will slow it down, but in almost any reasonable situation the replaced characters will be much rarer than other characters. Still, jterrace's answer is perfect.

Possible to do a string replace with a dictionary?

Tags:

python

David542

2 Answers

jterrace

David Robinson

Recent Activity

Donate For Us

Possible to do a string replace with a dictionary?

Tags:

python

David542

2 Answers

jterrace

David Robinson

Related questions

Recent Activity

Donate For Us