Say I have a string that looks like this: <pre class="prettyprint"><code>str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" </code></pre> You'll notice a lot of locations in the string where there is an ampersand, followed by a character (such as "&y" and "&c"). I need to replace these characters with an appropriate value that I have in a dictionary, like so: <pre class="prettyprint"><code>dict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} </code></pre> What is the fastest way to do this? I could manually find all the ampersands, then loop through the dictionary to change them, but that seems slow. Doing a bunch of regex replaces seems slow as well (I will have a dictionary of about 30-40 pairs in my actual code). Any suggestions are appreciated, thanks. Edit: As has been pointed out in comments throught this question, my dictionary is defined before runtime, and will never change during the course of the applications life cycle. It is a list of ANSI escape sequences, and will have about 40 items in it. My average string length to compare against will be about 500 characters, but there will be ones that are up to 5000 characters (although, these will be rare). I am also using Python 2.6 currently. Edit #2 I accepted Tor Valamos answer as the correct one, as it not only gave a valid solution (although it wasn't the best solution), but took all others into account and did a tremendous amount of work to compare all of them. That answer is one of the best, most helpful answers I have ever come across on StackOverflow. Kudos to you.

<pre class="prettyprint"><code>mydict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" for k, v in mydict.iteritems(): mystr = mystr.replace(k, v) print mystr The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog </code></pre> I took the liberty of comparing a few solutions: <pre class="prettyprint"><code>mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))]) # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2): mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars from time import time # How many times to run each solution rep = 10000 print 'Running %d times with string length %d and ' \ 'random inserts of lengths 0-20' % (rep, len(mystr)) # My solution t = time() for x in range(rep): for k, v in mydict.items(): mystr.replace(k, v) #print(mystr) print '%-30s' % 'Tor fixed & variable dict', time()-t from re import sub, compile, escape # Peter Hansen t = time() for x in range(rep): sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print '%-30s' % 'Peter fixed & variable dict', time()-t # Claudiu def multiple_replace(dict, text): # Create a regular expression from the dictionary keys regex = compile("(%s)" % "|".join(map(escape, dict.keys()))) # For each match, look-up corresponding value in dictionary return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) t = time() for x in range(rep): multiple_replace(mydict, mystr) print '%-30s' % 'Claudio variable dict', time()-t # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys()))) t = time() for x in range(rep): regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print '%-30s' % 'Claudio fixed dict', time()-t # Andrew Y - variable dict def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:])) t = time() for x in range(rep): mysubst(mystr, mydict) print '%-30s' % 'Andrew Y variable dict', time()-t # Andrew Y - fixed def repl(s): return mydict["&"+s[0:1]] + s[1:] t = time() for x in range(rep): subs = mystr.split("&") res = subs[0] + "".join(map(repl, subs[1:])) print '%-30s' % 'Andrew Y fixed dict', time()-t </code></pre> Results in Python 2.6 <pre class="prettyprint"><code>Running 10000 times with string length 490 and random inserts of lengths 0-20 Tor fixed & variable dict 1.04699993134 Peter fixed & variable dict 0.218999862671 Claudio variable dict 2.48400020599 Claudio fixed dict 0.0940001010895 Andrew Y variable dict 0.0309998989105 Andrew Y fixed dict 0.0310001373291 </code></pre> Both claudiu's and andrew's solutions kept going into 0, so I had to increase it to 10 000 runs. I ran it in Python 3 (because of unicode) with replacements of chars from 39 to 1024 (38 is ampersand, so I didn't wanna include it). String length up to 10.000 including about 980 replacements with variable random inserts of length 0-20. The unicode values from 39 to 1024 causes characters of both 1 and 2 bytes length, which could affect some solutions. <pre class="prettyprint"><code>mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)]) # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2): mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars from time import time # How many times to run each solution rep = 10000 print('Running %d times with string length %d and ' \ 'random inserts of lengths 0-20' % (rep, len(mystr))) # Tor Valamo - too long #t = time() #for x in range(rep): # for k, v in mydict.items(): # mystr.replace(k, v) #print('%-30s' % 'Tor fixed & variable dict', time()-t) from re import sub, compile, escape # Peter Hansen t = time() for x in range(rep): sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print('%-30s' % 'Peter fixed & variable dict', time()-t) # Peter 2 def dictsub(m): return mydict[m.group()] t = time() for x in range(rep): sub(r'(&[a-zA-Z])', dictsub, mystr) print('%-30s' % 'Peter fixed dict', time()-t) # Claudiu - too long #def multiple_replace(dict, text): # # Create a regular expression from the dictionary keys # regex = compile("(%s)" % "|".join(map(escape, dict.keys()))) # # # For each match, look-up corresponding value in dictionary # return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) # #t = time() #for x in range(rep): # multiple_replace(mydict, mystr) #print('%-30s' % 'Claudio variable dict', time()-t) # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys()))) t = time() for x in range(rep): regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print('%-30s' % 'Claudio fixed dict', time()-t) # Separate setup for Andrew and gnibbler optimized dict mydict = dict((k[1], v) for k, v in mydict.items()) # Andrew Y - variable dict def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:])) def mysubst2(somestr, somedict): subs = somestr.split("&") return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:])) t = time() for x in range(rep): mysubst(mystr, mydict) print('%-30s' % 'Andrew Y variable dict', time()-t) t = time() for x in range(rep): mysubst2(mystr, mydict) print('%-30s' % 'Andrew Y variable dict 2', time()-t) # Andrew Y - fixed def repl(s): return mydict[s[0:1]] + s[1:] t = time() for x in range(rep): subs = mystr.split("&") res = subs[0] + "".join(map(repl, subs[1:])) print('%-30s' % 'Andrew Y fixed dict', time()-t) # gnibbler t = time() for x in range(rep): myparts = mystr.split("&") myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]] "".join(myparts) print('%-30s' % 'gnibbler fixed & variable dict', time()-t) </code></pre> Results: <pre class="prettyprint"><code>Running 10000 times with string length 9491 and random inserts of lengths 0-20 Tor fixed & variable dict 0.0 # disqualified 329 secs Peter fixed & variable dict 2.07799983025 Peter fixed dict 1.53100013733 Claudio variable dict 0.0 # disqualified, 37 secs Claudio fixed dict 1.5 Andrew Y variable dict 0.578000068665 Andrew Y variable dict 2 0.56299996376 Andrew Y fixed dict 0.56200003624 gnibbler fixed & variable dict 0.530999898911 </code></pre> (** Note that gnibbler's code uses a different dict, where keys don't have the '&' included. Andrew's code also uses this alternate dict, but it didn't make much of a difference, maybe just 0.01x speedup.)

Mass string replace in python?

Tags:

performance

python

string

regex

replace

Say I have a string that looks like this:

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

You'll notice a lot of locations in the string where there is an ampersand, followed by a character (such as "&y" and "&c"). I need to replace these characters with an appropriate value that I have in a dictionary, like so:

dict = {"&y":"\033[0;30m",         "&c":"\033[0;31m",         "&b":"\033[0;32m",         "&Y":"\033[0;33m",         "&u":"\033[0;34m"}

What is the fastest way to do this? I could manually find all the ampersands, then loop through the dictionary to change them, but that seems slow. Doing a bunch of regex replaces seems slow as well (I will have a dictionary of about 30-40 pairs in my actual code).

Any suggestions are appreciated, thanks.

Edit:

As has been pointed out in comments throught this question, my dictionary is defined before runtime, and will never change during the course of the applications life cycle. It is a list of ANSI escape sequences, and will have about 40 items in it. My average string length to compare against will be about 500 characters, but there will be ones that are up to 5000 characters (although, these will be rare). I am also using Python 2.6 currently.

Edit #2 I accepted Tor Valamos answer as the correct one, as it not only gave a valid solution (although it wasn't the best solution), but took all others into account and did a tremendous amount of work to compare all of them. That answer is one of the best, most helpful answers I have ever come across on StackOverflow. Kudos to you.

786

asked Dec 17 '09 02:12

Mike Trpcic

1 Answers

mydict = {"&y":"\033[0;30m",           "&c":"\033[0;31m",           "&b":"\033[0;32m",           "&Y":"\033[0;33m",           "&u":"\033[0;34m"} mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"  for k, v in mydict.iteritems():     mystr = mystr.replace(k, v)  print mystr The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog

I took the liberty of comparing a few solutions:

mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])  # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2):     mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars  from time import time  # How many times to run each solution rep = 10000  print 'Running %d times with string length %d and ' \       'random inserts of lengths 0-20' % (rep, len(mystr))  # My solution t = time() for x in range(rep):     for k, v in mydict.items():         mystr.replace(k, v)     #print(mystr) print '%-30s' % 'Tor fixed & variable dict', time()-t  from re import sub, compile, escape  # Peter Hansen t = time() for x in range(rep):     sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print '%-30s' % 'Peter fixed & variable dict', time()-t  # Claudiu def multiple_replace(dict, text):      # Create a regular expression  from the dictionary keys     regex = compile("(%s)" % "|".join(map(escape, dict.keys())))      # For each match, look-up corresponding value in dictionary     return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)  t = time() for x in range(rep):     multiple_replace(mydict, mystr) print '%-30s' % 'Claudio variable dict', time()-t  # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))  t = time() for x in range(rep):     regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print '%-30s' % 'Claudio fixed dict', time()-t  # Andrew Y - variable dict def mysubst(somestr, somedict):   subs = somestr.split("&")   return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))  t = time() for x in range(rep):     mysubst(mystr, mydict) print '%-30s' % 'Andrew Y variable dict', time()-t  # Andrew Y - fixed def repl(s):   return mydict["&"+s[0:1]] + s[1:]  t = time() for x in range(rep):     subs = mystr.split("&")     res = subs[0] + "".join(map(repl, subs[1:])) print '%-30s' % 'Andrew Y fixed dict', time()-t

Results in Python 2.6

Running 10000 times with string length 490 and random inserts of lengths 0-20 Tor fixed & variable dict      1.04699993134 Peter fixed & variable dict    0.218999862671 Claudio variable dict          2.48400020599 Claudio fixed dict             0.0940001010895 Andrew Y variable dict         0.0309998989105 Andrew Y fixed dict            0.0310001373291

Both claudiu's and andrew's solutions kept going into 0, so I had to increase it to 10 000 runs.

I ran it in Python 3 (because of unicode) with replacements of chars from 39 to 1024 (38 is ampersand, so I didn't wanna include it). String length up to 10.000 including about 980 replacements with variable random inserts of length 0-20. The unicode values from 39 to 1024 causes characters of both 1 and 2 bytes length, which could affect some solutions.

mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])  # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2):     mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars  from time import time  # How many times to run each solution rep = 10000  print('Running %d times with string length %d and ' \       'random inserts of lengths 0-20' % (rep, len(mystr)))  # Tor Valamo - too long #t = time() #for x in range(rep): #    for k, v in mydict.items(): #        mystr.replace(k, v) #print('%-30s' % 'Tor fixed & variable dict', time()-t)  from re import sub, compile, escape  # Peter Hansen t = time() for x in range(rep):     sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print('%-30s' % 'Peter fixed & variable dict', time()-t)  # Peter 2 def dictsub(m):     return mydict[m.group()]  t = time() for x in range(rep):     sub(r'(&[a-zA-Z])', dictsub, mystr) print('%-30s' % 'Peter fixed dict', time()-t)  # Claudiu - too long #def multiple_replace(dict, text):  #    # Create a regular expression  from the dictionary keys #    regex = compile("(%s)" % "|".join(map(escape, dict.keys()))) # #    # For each match, look-up corresponding value in dictionary #    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) # #t = time() #for x in range(rep): #    multiple_replace(mydict, mystr) #print('%-30s' % 'Claudio variable dict', time()-t)  # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))  t = time() for x in range(rep):     regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print('%-30s' % 'Claudio fixed dict', time()-t)  # Separate setup for Andrew and gnibbler optimized dict mydict = dict((k[1], v) for k, v in mydict.items())  # Andrew Y - variable dict def mysubst(somestr, somedict):   subs = somestr.split("&")   return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))  def mysubst2(somestr, somedict):   subs = somestr.split("&")   return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))  t = time() for x in range(rep):     mysubst(mystr, mydict) print('%-30s' % 'Andrew Y variable dict', time()-t) t = time() for x in range(rep):     mysubst2(mystr, mydict) print('%-30s' % 'Andrew Y variable dict 2', time()-t)  # Andrew Y - fixed def repl(s):   return mydict[s[0:1]] + s[1:]  t = time() for x in range(rep):     subs = mystr.split("&")     res = subs[0] + "".join(map(repl, subs[1:])) print('%-30s' % 'Andrew Y fixed dict', time()-t)  # gnibbler t = time() for x in range(rep):     myparts = mystr.split("&")     myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]     "".join(myparts) print('%-30s' % 'gnibbler fixed & variable dict', time()-t)

Results:

Running 10000 times with string length 9491 and random inserts of lengths 0-20 Tor fixed & variable dict      0.0 # disqualified 329 secs Peter fixed & variable dict    2.07799983025 Peter fixed dict               1.53100013733  Claudio variable dict          0.0 # disqualified, 37 secs Claudio fixed dict             1.5 Andrew Y variable dict         0.578000068665 Andrew Y variable dict 2       0.56299996376 Andrew Y fixed dict            0.56200003624 gnibbler fixed & variable dict 0.530999898911

(** Note that gnibbler's code uses a different dict, where keys don't have the '&' included. Andrew's code also uses this alternate dict, but it didn't make much of a difference, maybe just 0.01x speedup.)

answered Oct 06 '22 21:10

10 revs

Related questions
                            
                                python 2 code: if python 3 then sys.exit()
                            
                                Beautiful Soup findAll doesn't find them all
                            
                                How can I decorate an instance method with a decorator class?
                            
                                Peeking in a heap in python
                            
                                Find which python modules are being imported
                            
                                how to check DEBUG true/false in django template - exactly in layout.html [duplicate]
                            
                                Beginner Python Practice? [closed]
                            
                                How to iterate Queue.Queue items in Python?
                            
                                How do you call an instance of a class in Python?
                            
                                Remove first x number of characters from each row in a column of a Python dataframe
                            
                                What is the correct way to change image channel ordering between channels first and channels last?
                            
                                Facebook JSON badly encoded
                            
                                Can I get an item from a PriorityQueue without removing it yet?
                            
                                Python argument parser list of list or tuple of tuples
                            
                                which is faster for load: pickle or hdf5 in python
                            
                                Importing the numpy c-extensions failed
                            
                                Qt programming: More productive in Python or C++?
                            
                                Proper way to install pip on Ubuntu
                            
                                PCA For categorical features?
                            
                                Why does Python compile modules but not the script being run?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With