Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Possible to do a string replace with a dictionary?

Tags:

python

I would like to change all accented characters into non-accented characters:

 conversion_dict = {"ä": "a", "ö": "o", "ü": "u","Ä": "A", "Ö": "O", "Ü": "U",
                   "á": "a", "à": "a", "â": "a", "é": "e", "è": "e", "ê": "e",
                   "ú": "u", "ù": "u", "û": "u", "ó": "o", "ò": "o", "ô": "o",
                   "Á": "A", "À": "A", "Â": "A", "É": "E", "È": "E", "Ê": "E",
                   "Ú": "U", "Ù": "U", "Û": "U", "Ó": "O", "Ò": "O", "Ô": "O","ß": "s"}

Is there a way to do something like "paragraph of text".replace([conversion_dict])?

like image 430
David542 Avatar asked Feb 16 '12 19:02

David542


2 Answers

preferred method using third-party module

A much better alternative than the method below is to use the awesome unidecode module:

>>> import unidecode
>>> somestring = u"äüÊÂ"
>>> unidecode.unidecode(somestring)
'auEA'

built-in, slightly-hazardous method

Inferring from your question that you are looking to normalize unicode characters, there is actually a nice, built-in way to do this:

>>> somestring = u"äüÊÂ"
>>> somestring
u'\xe4\xfc\xca\xc2'
>>> import unicodedata
>>> unicodedata.normalize('NFKD', somestring).encode('ascii', 'ignore')
'auEA'

Check out the documentation for unicodedata.normalize.

Note, however, that there might be some issues with this. See this post for a nice explanation and some workarounds.

See also, latin-1-to-ascii for alternatives.

like image 162
jterrace Avatar answered Sep 18 '22 19:09

jterrace


for k, v in conversion_dict.items():
    txt = txt.replace(k, v)

ETA: This isn't "horribly" slow at all. Here's a timer for a toy case, where we're replacing a 100000 character string using a dictionary that has mappings of 56 characters in it where none of the characters are in the string:

import timeit

NUM_REPEATS = 100000

conversion_dict = dict([(chr(i), "C") for i in xrange(100)])

txt = "A" * 100000

def replace(x):
    for k, v in conversion_dict.items():
        x = x.replace(k, v)

t = timeit.Timer("replace(txt)", setup="from __main__ import replace, txt")
print t.timeit(NUM_REPEATS) / NUM_REPEATS, "sec / call"

On my computer I get the running time

0.0056938188076 sec / call

So one two-hundredth of a second for a 100,000 character string. Now, some of the characters actually will be in the string, and this will slow it down, but in almost any reasonable situation the replaced characters will be much rarer than other characters. Still, jterrace's answer is perfect.

like image 44
David Robinson Avatar answered Sep 19 '22 19:09

David Robinson