Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently carry out multiple string replacements in Python

If I would like to carry out multiple string replacements, what is the most efficient way to carry this out?

An example of the kind of situation I have encountered in my travels is as follows:

>>> strings = ['a', 'list', 'of', 'strings']
>>> [s.replace('a', '')...replace('u', '')  for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']
like image 845
Tim McNamara Avatar asked Jul 29 '10 23:07

Tim McNamara


2 Answers

The specific example you give (deleting single characters) is perfect for the translate method of strings, as is substitution of single characters with single characters. If the input string is a Unicode one, then, as well as the two above kinds of "substitution", substitution of single characters with multiple character strings is also fine with the translate method (not if you need to work on byte strings, though).

If you need to replace substrings of multiple characters, then I would also recommend using a regular expression -- though not in the way @gnibbler's answer recommends; rather, I'd build the regex from r'onestring|another|yetanother|orthis' (join the substrings you want to replace with vertical bars -- be sure to also re.escape them if they contain special characters, of course) and write a simple substituting-function based on a dict.

I'm not going to offer a lot of code at this time since I don't know which of the two paragraphs applies to your actual needs, but (when I later come back home and check SO again;-) I'll be glad to edit to add a code example as necessary depending on your edits to your question (more useful than comments to this answer;-).

Edit: in a comment the OP says he wants a "more general" answer (without clarifying what that means) then in an edit of his Q he says he wants to study the "tradeoffs" between various snippets all of which use single-character substrings (and check presence thereof, rather than replacing as originally requested -- completely different semantics, of course).

Given this utter and complete confusion all I can say is that to "check tradeoffs" (performance-wise) I like to use python -mtimeit -s'setup things here' 'statements to check' (making sure the statements to check have no side effects to avoid distorting the time measurements, since timeit implicitly loops to provide accurate timing measurements).

A general answer (without any tradeoffs, and involving multiple-character substrings, so completely contrary to his Q's edit but consonant to his comments -- the two being entirely contradictory it is of course impossible to meet both):

import re

class Replacer(object):

  def __init__(self, **replacements):
    self.replacements = replacements
    self.locator = re.compile('|'.join(re.escape(s) for s in replacements))

  def _doreplace(self, mo):
    return self.replacements[mo.group()]

  def replace(self, s):
    return self.locator.sub(self._doreplace, s)

Example use:

r = Replacer(zap='zop', zip='zup')
print r.replace('allazapollezipzapzippopzip')

If some of the substrings to be replaced are Python keywords, they need to be passed in a tad less directly, e.g., the following:

r = Replacer(abc='xyz', def='yyt', ghi='zzq')

would fail because def is a keyword, so you need e.g.:

r = Replacer(abc='xyz', ghi='zzq', **{'def': 'yyt'})

or the like.

I find this a good use for a class (rather than procedural programming) because the RE to locate the substrings to replace, the dict expressing what to replace them with, and the method performing the replacement, really cry out to be "kept all together", and a class instance is just the right way to perform such a "keeping together" in Python. A closure factory would also work (since the replace method is really the only part of the instance that needs to be visible "outside") but in a possibly less-clear, harder to debug way:

def make_replacer(**replacements):
  locator = re.compile('|'.join(re.escape(s) for s in replacements))

  def _doreplace(mo):
    return replacements[mo.group()]

  def replace(s):
    return locator.sub(_doreplace, s)

  return replace

r = make_replacer(zap='zop', zip='zup')
print r('allazapollezipzapzippopzip')

The only real advantage might be a very modestly better performance (needs to be checked with timeit on "benchmark cases" considered significant and representative for the app using it) as the access to the "free variables" (replacements, locator, _doreplace) in this case might be minutely faster than access to the qualified names (self.replacements etc) in the normal, class-based approach (whether this is the case will depend on the Python implementation in use, whence the need to check with timeit on significant benchmarks!).

like image 62
Alex Martelli Avatar answered Nov 11 '22 21:11

Alex Martelli


You may find that it is faster to create a regex and do all the replacements at once.

Also a good idea to move the replacement code out to a function so that you can memoize if you are likely to have duplicates in the list

>>> import re
>>> [re.sub('[aeiou]','',s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']


>>> def replacer(s, memo={}):
...   if s not in memo:
...     memo[s] = re.sub('[aeiou]','',s)
...   return memo[s]
... 
>>> [replacer(s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']
like image 45
John La Rooy Avatar answered Nov 11 '22 21:11

John La Rooy